Search Results: "Vincent Bernat"

3 May 2017

Vincent Bernat: VXLAN: BGP EVPN with Cumulus Quagga

VXLAN is an overlay network to encapsulate Ethernet traffic over an existing (highly available and scalable, possibly the Internet) IP network while accomodating a very large number of tenants. It is defined in RFC 7348. For an uncut introduction on its use with Linux, have a look at my VXLAN & Linux post. VXLAN deployment In the above example, we have hypervisors hosting a virtual machines from different tenants. Each virtual machine is given access to a tenant-specific virtual Ethernet segment. Users are expecting classic Ethernet segments: no MAC restrictions1, total control over the IP addressing scheme they use and availability of multicast. In a large VXLAN deployment, two aspects need attention:
  1. discovery of other endpoints (VTEPs) sharing the same VXLAN segments, and
  2. avoidance of BUM frames (broadcast, unknown unicast and multicast) as they have to be forwarded to all VTEPs.
A typical solution for the first point is using multicast. For the second point, this is source-address learning.

Introduction to BGP EVPN BGP EVPN (RFC 7432 and draft-ietf-bess-evpn-overlay for its application to VXLAN) is a standard control protocol to efficiently solves those two aspects without relying on multicast nor source-address learning. BGP EVPN relies on BGP (RFC 4271) and its MP-BGP extensions (RFC 4760). BGP is the routing protocol powering the Internet. It is highly scalable and interoperable. It is also extensible and one of its extension is MP-BGP. This extension can carry reachability information (NLRI) for multiple protocols (IPv4, IPv6, L3VPN and in our case EVPN). EVPN is a special family to advertise MAC addresses and the remote equipments they are attached to. There are basically two kinds of reachability information a VTEP sends through BGP EVPN:
  1. the VNIs they have interest in (type 3 routes), and
  2. for each VNI, the local MAC addresses (type 2 routes).
The protocol also covers other aspects of virtual Ethernet segments (L3 reachability information from ARP/ND caches, MAC mobility and multi-homing2) but we won t describe them here. To deploy BGP EVPN, a typical solution is to use several route reflectors (both for redundancy and scalability), like in the picture below. Each VTEP opens a BGP session to at least two route reflectors, sends its information (MACs and VNIs) and receives others . This reduces the number of BGP sessions to configure. VXLAN deployment with route reflectors Compared to other solutions to deploy VXLAN, BGP EVPN has three main advantages:
  • interoperability with other vendors (notably Juniper and Cisco),
  • proven scalability (a typical BGP routers handle several millions of routes), and
  • possibility to enforce fine-grained policies.
On Linux, Cumulus Quagga is a fairly complete implementation of BGP EVPN (type 3 routes for VTEP discovery, type 2 routes with MAC or IP addresses, MAC mobility when a host changes from one VTEP to another one) which requires very little configuration. This is a fork of Quagga and currently used in Cumulus Linux, a network operating system based on Debian powering switches from various brands. At some point, BGP EVPN support will be contributed back to FRR, a community-maintained fork of Quagga3. It should be noted the BGP EVPN implementation of Cumulus Quagga currently only supports IPv4.

Route reflector setup Before configuring each VTEP, we need to configure two or more route reflectors. There are many solutions. I will present three of them:
  • using Cumulus Quagga,
  • using GoBGP, an implementation of BGP in Go,
  • using Juniper JunOS.
For reliability purpose, it s possible (and easy) to use one implementation for some route reflectors and another implementation for the other ones. The proposed configurations are quite minimal. However, it is possible to centralize policies on the route reflectors (e.g. routes tagged with some community can only be readvertised to some group of VTEPs).

Using Quagga The configuration is pretty simple. We suppose the configured route reflector has 203.0.113.254 configured as a loopback IP.
router bgp 65000
  bgp router-id 203.0.113.254
  bgp cluster-id 203.0.113.254
  bgp log-neighbor-changes
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source 203.0.113.254
  bgp listen range 203.0.113.0/24 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   neighbor fabric route-reflector-client
  exit-address-family
  !
  exit
!
A peer group fabric is defined and we leverage the dynamic neighbor feature of Cumulus Quagga: we don t have to explicitely define each neighbor. Any client from 203.0.113.0/24 and presenting itself as part of AS 65000 can connect. All sent EVPN routes will be accepted and reflected to the other clients. You don t need to run Zebra, the route engine talking with the kernel. Instead, start bgpd with the --no_kernel flag.

Using GoBGP GoBGP is a clean implementation of BGP in Go4. It exposes an RPC API for configuration (but accepts a configuration file and comes with a command-line client). It doesn t support dynamic neighbors, so you ll have to use the API, the command-line client or some templating language to automate their declaration. A configuration with only one neighbor is like this:
global:
  config:
    as: 65000
    router-id: 203.0.113.254
    local-address-list:
      - 203.0.113.254
neighbors:
  - config:
      neighbor-address: 203.0.113.1
      peer-as: 65000
    afi-safis:
      - config:
          afi-safi-name: l2vpn-evpn
    route-reflector:
      config:
        route-reflector-client: true
        route-reflector-cluster-id: 203.0.113.254
More neighbors can be added from the command line:
$ gobgp neighbor add 203.0.113.2 as 65000 \
>         route-reflector-client 203.0.113.254 \
>         --address-family evpn
GoBGP won t try to interact with the kernel which is fine as a route reflector.

Using Juniper JunOS A variety of Juniper products can be a BGP route reflector, notably: The main factor is the CPU and the memory. The QFX5100 is low on memory and won t support large deployments without some additional policing. Here is a configuration similar to the Quagga one:
interfaces  
    lo0  
        unit 0  
            family inet  
                address 203.0.113.254/32;
             
         
     
 
protocols  
    bgp  
        group fabric  
            family evpn  
                signaling  
                    /* Do not try to install EVPN routes */
                    no-install;
                 
             
            type internal;
            cluster 203.0.113.254;
            local-address 203.0.113.254;
            allow 203.0.113.0/24;
         
     
 
routing-options  
    router-id 203.0.113.254;
    autonomous-system 65000;
 

VTEP setup The next step is to configure each VTEP/hypervisor. Each VXLAN is locally configured using a bridge for local virtual interfaces, like illustrated in the below schema. The bridge is taking care of the local MAC addresses (notably, using source-address learning) and the VXLAN interface takes care of the remote MAC addresses (received with BGP EVPN). Bridged VXLAN device VXLANs can be provisioned with the following script. Source-address learning is disabled as we will rely solely on BGP EVPN to synchronize FDBs between the hypervisors.
for vni in 100 200; do
    # Create VXLAN interface
    ip link add vxlan$ vni  type vxlan
        id $ vni  \
        dstport 4789 \
        local 203.0.113.2 \
        nolearning
    # Create companion bridge
    brctl addbr br$ vni 
    brctl addif br$ vni  vxlan$ vni 
    brctl stp br$ vni  off
    ip link set up dev br$ vni 
    ip link set up dev vxlan$ vni 
done
# Attach each VM to the appropriate segment
brctl addif br100 vnet10
brctl addif br100 vnet11
brctl addif br200 vnet12
The configuration of Cumulus Quagga is similar to the one used for a route reflector, except we use the advertise-all-vni directive to publish all local VNIs.
router bgp 65000
  bgp router-id 203.0.113.2
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source dummy0
  ! BGP sessions with route reflectors
  neighbor 203.0.113.253 peer-group fabric
  neighbor 203.0.113.254 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   advertise-all-vni
  exit-address-family
  !
  exit
!
If everything works as expected, the instances sharing the same VNI should be able to ping each other. If IPv6 is enabled on the VMs, the ping command shows if everything is in order:
$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Verification Step by step, let s check how everything comes together.

Getting VXLAN information from the kernel On each VTEP, Quagga should be able to retrieve the information about configured VXLANs. This can be checked with vtysh:
# show interface vxlan100
Interface vxlan100 is up, line protocol is up
  Link ups:       1    last: 2017/04/29 20:01:33.43
  Link downs:     0    last: (never)
  PTM status: disabled
  vrf: Default-IP-Routing-Table
  index 11 metric 0 mtu 1500
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: 62:42:7a:86:44:01
  inet6 fe80::6042:7aff:fe86:4401/64
  Interface Type Vxlan
  VxLAN Id 100
  Access VLAN Id 1
  Master (bridge) ifindex 9 ifp 0x56536e3f3470
The important points are:
  • the VNI is 100, and
  • the bridge device was correctly detected.
Quagga should also be able to retrieve information about the local MAC addresses :
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 2
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100

BGP sessions Each VTEP has to establish a BGP session to the route reflectors. On the VTEP, this can be checked by running vtysh:
# show bgp neighbors 203.0.113.254
BGP neighbor is 203.0.113.254, remote AS 65000, local AS 65000, internal link
 Member of peer-group fabric for session parameters
  BGP version 4, remote router ID 203.0.113.254
  BGP state = Established, up for 00:00:45
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      L2VPN EVPN: RX advertised L2VPN EVPN
    Route refresh: advertised and received(new)
    Address family L2VPN EVPN: advertised and received
    Hostname Capability: advertised
    Graceful Restart Capabilty: advertised
[...]
 For address family: L2VPN EVPN
  fabric peer-group member
  Update group 1, subgroup 1
  Packet Queue length 0
  Community attribute sent to this neighbor(both)
  8 accepted prefixes

  Connections established 1; dropped 0
  Last reset never
Local host: 203.0.113.2, Local port: 37603
Foreign host: 203.0.113.254, Foreign port: 179
The output includes the following information:
  • the BGP state is Established,
  • the address family L2VPN EVPN is correctly advertised, and
  • 8 routes are received from this route reflector.
The state of the BGP sessions can also be checked from the route reflectors. With GoBGP, use the following command:
# gobgp neighbor 203.0.113.2
BGP neighbor is 203.0.113.2, remote AS 65000, route-reflector-client
  BGP version 4, remote router ID 203.0.113.2
  BGP state = established, up for 00:04:30
  BGP OutQ = 0, Flops = 0
  Hold time is 9, keepalive interval is 3 seconds
  Configured hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    multiprotocol:
        l2vpn-evpn:     advertised and received
    route-refresh:      advertised and received
    graceful-restart:   received
    4-octet-as: advertised and received
    add-path:   received
    UnknownCapability(73):      received
    cisco-route-refresh:        received
[...]
  Route statistics:
    Advertised:             8
    Received:               5
    Accepted:               5
With JunOS, use the below command:
> show bgp neighbor 203.0.113.2
Peer: 203.0.113.2+38089 AS 65000 Local: 203.0.113.254+179 AS 65000
  Group: fabric                Routing-Instance: master
  Forwarding routing-instance: master
  Type: Internal    State: Established
  Last State: OpenConfirm   Last Event: RecvKeepAlive
  Last Error: None
  Options: <Preference LocalAddress Cluster AddressFamily Rib-group Refresh>
  Address families configured: evpn
  Local Address: 203.0.113.254 Holdtime: 90 Preference: 170
  NLRI evpn: NoInstallForwarding
  Number of flaps: 0
  Peer ID: 203.0.113.2     Local ID: 203.0.113.254     Active Holdtime: 9
  Keepalive Interval: 3          Group index: 0    Peer index: 2
  I/O Session Thread: bgpio-0 State: Enabled
  BFD: disabled, down
  NLRI for restart configured on peer: evpn
  NLRI advertised by peer: evpn
  NLRI for this session: evpn
  Peer supports Refresh capability (2)
  Stale routes from peer are kept for: 300
  Peer does not support Restarter functionality
  NLRI that restart is negotiated for: evpn
  NLRI of received end-of-rib markers: evpn
  NLRI of all end-of-rib markers sent: evpn
  Peer does not support LLGR Restarter or Receiver functionality
  Peer supports 4 byte AS extension (peer-as 65000)
  NLRI's for which peer can receive multiple paths: evpn
  Table bgp.evpn.0 Bit: 20000
    RIB State: BGP restart is complete
    RIB State: VPN restart is complete
    Send state: in sync
    Active prefixes:              5
    Received prefixes:            5
    Accepted prefixes:            5
    Suppressed due to damping:    0
    Advertised prefixes:          8
  Last traffic (seconds): Received 276  Sent 170  Checked 276
  Input messages:  Total 61     Updates 3       Refreshes 0     Octets 1470
  Output messages: Total 62     Updates 4       Refreshes 0     Octets 1775
  Output Queue[1]: 0            (bgp.evpn.0, evpn)
If a BGP session cannot be established, the logs of each BGP daemon should mention the cause.

Sent routes From each VTEP, Quagga needs to send:
  • one type 3 route for each local VNI, and
  • one type 2 route for each local MAC address.
The best place to check the received routes is on one of the route reflectors. If you are using JunOS, the following command will display the received routes from the provided VTEP:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2
bgp.evpn.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
  2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP
*                         203.0.113.2                  100        I
  2:203.0.113.2:100::0::50:54:33:00:00:0b/304 MAC/IP
*                         203.0.113.2                  100        I
  3:203.0.113.2:100::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
  3:203.0.113.2:200::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
There is one type 3 route for VNI 100 and another one for VNI 200. There are also two type 2 routes for two MAC addresses on VNI 100. To get more information, you can add the keyword extensive. Here is a type 3 route advertising 203.0.113.2 as a VTEP for VNI 1008:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 3:203.0.113.2:100::0::203.0.113.2/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]
Here is a type 2 route announcing the location of the 50:54:33:00:00:0a MAC address for VNI 100:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Route Label: 100
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]
With Quagga, you can get a similar output with vtysh:
# show bgp evpn route
BGP table version is 0, local router ID is 203.0.113.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 203.0.113.2:100
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0a]
                    203.0.113.2                   100      0 i
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0b]
                    203.0.113.2                   100      0 i
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
Route Distinguisher: 203.0.113.2:200
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
[...]
With GoBGP, use the following command:
# gobgp global rib -a evpn   grep rd:203.0.113.2:200
    Network  Next Hop             AS_PATH              Age        Attrs
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0b][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:200][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[200]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]
*>  [type:multicast][rd:203.0.113.2:100][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:multicast][rd:203.0.113.2:200][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]

Received routes Each VTEP should have received the type 2 and type 3 routes from its fellow VTEPs, through the route reflectors. You can check with the show bgp evpn route command of vtysh. Does Quagga correctly understand the received routes? The type 3 routes are translated to an assocation between the remote VTEPs and the VNIs:
# show evpn vni
Number of VNIs: 2
VNI        VxLAN IF              VTEP IP         # MACs   # ARPs   Remote VTEPs
100        vxlan100              203.0.113.2     4        0        203.0.113.3
                                                                   203.0.113.1
200        vxlan200              203.0.113.2     3        0        203.0.113.3
                                                                   203.0.113.1
The type 2 routes are translated to an association between the remote MACs and the remote VTEPs:
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:09 remote 203.0.113.1
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100
50:54:33:00:00:0c remote 203.0.113.3

FDB configuration The last step is to ensure Quagga has correctly provided the received information to the kernel. This can be checked with the bridge command:
# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst 203.0.113.1 self permanent
00:00:00:00:00:00 dst 203.0.113.3 self permanent
50:54:33:00:00:0c dst 203.0.113.3 self
50:54:33:00:00:09 dst 203.0.113.1 self
All good! The two first lines are the translation of the type 3 routes (any BUM frame will be sent to both 203.0.113.1 and 203.0.113.3) and the two last ones are the translation of the type 2 routes.

Interoperability One of the strength of BGP EVPN is the interoperability with other network vendors. To demonstrate it works as expected, we will configure a Juniper vMX to act as a VTEP. First, we need to configure the physical bridge9. This is similar to the use of ip link and brctl with Linux. We only configure one physical interface with two old-school VLANs paired with matching VNIs.
interfaces  
    ge-0/0/1  
        unit 0  
            family bridge  
                interface-mode trunk;
                vlan-id-list [ 100 200 ];
             
         
     
 
routing-instances  
    switch  
        instance-type virtual-switch;
        interface ge-0/0/1.0;
        bridge-domains  
            vlan100  
                domain-type bridge;
                vlan-id 100;
                vxlan  
                    vni 100;
                    ingress-node-replication;
                 
             
            vlan200  
                domain-type bridge;
                vlan-id 200;
                vxlan  
                    vni 200;
                    ingress-node-replication;
                 
             
         
     
 
Then, we configure BGP EVPN to advertise all known VNIs. The configuration is quite similar to the one we did with Quagga:
protocols  
    bgp  
        group fabric  
            type internal;
            multihop;
            family evpn signaling;
            local-address 203.0.113.3;
            neighbor 203.0.113.253;
            neighbor 203.0.113.254;
         
     
 
routing-instances  
    switch  
        vtep-source-interface lo0.0;
        route-distinguisher 203.0.113.3:1; #  
        vrf-import EVPN-VRF-VXLAN;
        vrf-target  
            target:65000:1;
            auto;
         
        protocols  
            evpn  
                encapsulation vxlan;
                extended-vni-list all;
                multicast-mode ingress-replication;
             
         
     
 
routing-options  
    router-id 203.0.113.3;
    autonomous-system 65000;
 
policy-options  
    policy-statement EVPN-VRF-VXLAN  
        then accept;
     
 
We also need a small compatibility patch for Cumulus Quagga10. The routes sent by this configuration are very similar to the routes sent by Quagga. The main differences are:
  • on JunOS, the route distinguisher is configured statically (in ), and
  • on JunOS, the VNI is also encoded as an Ethernet tag ID.
Here is a type 3 route, as sent by JunOS:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 3:203.0.113.3:1::100::203.0.113.3/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
     PMSI: Flags 0x0: Label 6: Type INGRESS-REPLICATION 203.0.113.3
[...]
Here is a type 2 route:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 2:203.0.113.3:1::200::50:54:33:00:00:0f/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Route Label: 200
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435656 encapsulation:vxlan(0x8)
[...]
We can check that the vMX is able to make sense of the routes it receives from its peers running Quagga:
> show evpn database l2-domain-id 100
Instance: switch
VLAN  DomainId  MAC address        Active source                  Timestamp        IP address
     100        50:54:33:00:00:0c  203.0.113.1                    Apr 30 12:46:20
     100        50:54:33:00:00:0d  203.0.113.2                    Apr 30 12:32:42
     100        50:54:33:00:00:0e  203.0.113.2                    Apr 30 12:46:20
     100        50:54:33:00:00:0f  ge-0/0/1.0                     Apr 30 12:45:55
On the other end, if we look at one of the Quagga-based VTEP, we can check the received routes are correctly understood:
# show evpn vni 100
VNI: 100
 VxLAN interface: vxlan100 ifIndex: 9 VTEP IP: 203.0.113.1
 Remote VTEPs for this VNI:
  203.0.113.3
  203.0.113.2
 Number of MACs (local and remote) known for this VNI: 4
 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 0
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0c local  eth1.100
50:54:33:00:00:0d remote 203.0.113.2
50:54:33:00:00:0e remote 203.0.113.2
50:54:33:00:00:0f remote 203.0.113.3
Get in touch if you have some success with other vendors!

  1. For example, they may use bridges to connect containers together.
  2. Such a feature can replace proprietary implementations of MC-LAG allowing several VTEPs to act as a endpoint for a single link aggregation group. This is not needed on our scenario where hypervisors act as VTEPs.
  3. The development of Quagga is slow and closed . New features are often stalled. FRR is placed under the umbrella of the Linux Foundation, has a GitHub-centered development model and an election process. It already has several interesting enhancements (notably, BGP add-path, BGP unnumbered, MPLS and LDP).
  4. I am unenthusiastic about projects whose the sole purpose is to rewrite something in Go. However, while being quite young, GoBGP is quite valuable on its own (good architecture, good performance).
  5. The 48-port version is around $10,000 with the BGP license.
  6. An empty chassis with a dual routing engine (RE-S-1800X4-16G) is around $30,000.
  7. I don t know how pricey the vRR is. For evaluation purposes, it can be downloaded for free if you are a customer.
  8. The value 100 used in the route distinguishier (203.0.113.2:100) is not the one used to encode the VNI. The VNI is encoded in the route target (65000:268435556), in the 24 least signifiant bits (268435556 & 0xffffff equals 100). As long as VNIs are unique, we don t have to understand those details.
  9. For some reason, the use of a virtual switch is mandatory. This is specific to this platform: a QFX doesn t require this.
  10. The encoding of the VNI into the route target is being standardized in draft-ietf-bess-evpn-overlay. Juniper already implements this draft.

Vincent Bernat: VXLAN & Linux

VXLAN is an overlay network to carry Ethernet traffic over an existing (highly available and scalable) IP network while accommodating a very large number of tenants. It is defined in RFC 7348. Starting from Linux 3.12, the VXLAN implementation is quite complete as both multicast and unicast are supported as well as IPv6 and IPv4. Let s explore the various methods to configure it. VXLAN setup To illustrate our examples, we use the following setup: A VXLAN tunnel extends the individual Ethernet segments accross the three bridges, providing a unique (virtual) Ethernet segment. From one host (e.g. H1), we can reach directly all the other hosts in the virtual segment:
$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Basic usage The reference deployment for VXLAN is to use an IP multicast group to join the other VTEPs:
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   group ff05::100 \
>   dev eth0 \
>   ttl 5
# brctl addbr br100
# brctl addif br100 vxlan100
# brctl addif br100 vnet22
# brctl addif br100 vnet25
# brctl stp br100 off
# ip link set up dev br100
# ip link set up dev vxlan100
The above commands create a new interface acting as a VXLAN tunnel endpoint, named vxlan100 and put it in a bridge with some regular interfaces1. Each VXLAN segment is associated to a 24-bit segment ID, the VXLAN Network Identifier (VNI). In our example, the default VNI is specified with id 100. When VXLAN was first implemented in Linux 3.7, the UDP port to use was not defined. Several vendors were using 8472 and Linux took the same value. To avoid breaking existing deployments, this is still the default value. Therefore, if you want to use the IANA-assigned port, you need to explicitely set it with dstport 4789. As we want to use multicast, we have to specify a multicast group to join (group ff05::100), as well as a physical device (dev eth0). With multicast, the default TTL is 1. If your multicast network leverages some routing, you ll have to increase the value a bit, like here with ttl 5. The vxlan100 device acts as a bridge device with remote VTEPs as virtual ports:
  • it sends broadcast, unknown unicast and multicast (BUM) frames to all VTEPs using the multicast group, and
  • it discovers the association from Ethernet MAC addresses to VTEP IP addresses using source-address learning.
The following figure summarizes the configuration, with the FDB of the Linux bridge (learning local MAC addresses) and the FDB of the VXLAN device (learning distant MAC addresses): Bridged VXLAN device The FDB of the VXLAN device can be observed with the bridge command. If the destination MAC is present, the frame is sent to the associated VTEP (unicast). The all-zero address is only used when a lookup for the destination MAC fails.
# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst ff05::100 via eth0 self permanent
50:54:33:00:00:0b dst 2001:db8:3::1 self
50:54:33:00:00:08 dst 2001:db8:1::1 self
If you are interested to get more details on how to setup a multicast network and build VXLAN segments on top of it, see my Network virtualization with VXLAN article.

Without multicast Using VXLAN over a multicast IP network has several benefits:
  • automatic discovery of other VTEPs sharing the same multicast group,
  • good bandwidth usage (packets are replicated as late as possible),
  • decentralized and controller-less design2.
However, multicast is not available everywhere and managing it at scale can be difficult. In Linux 3.8, the DOVE extensions have been added to the VXLAN implementation, removing the dependency on multicast.

Unicast with static flooding We can replace multicast by head-end replication of BUM frames to a statically configured lists of remote VTEPs3:
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1
The VXLAN is defined without a remote multicast group. Instead, all the remote VTEPs are associated with the all-zero address: a BUM frame will be duplicated to all those destinations. The VXLAN device will still learn remote addresses automatically using source-address learning. It is a very simple solution. With a bit of automation, you can keep the default FDB entries up-to-date easily. However, the host will have to duplicate each BUM frame (head-end replication) as many times as there are remote VTEPs. This is quite reasonable if you have a dozen of them. This may become out-of-hand if you have thousands of them. Cumulus vxfld daemon is an example of use of this strategy (in the head-end replication mode).

Unicast with static L2 entries When the associations of MAC addresses and VTEPs are known, it is possible to pre-populate the FDB and disable learning:
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1
Thanks to the nolearning flag, source-address learning is disabled. Therefore, if a MAC is missing, the frame will always be sent using the all-zero entries. The all-zero entries are still needed for broadcast and multicast traffic (e.g. ARP and IPv6 neighbor discovery). This kind of setup works well to provide virtual L2 networks to virtual machines (no L3 information available). You need some glue to update the FDB entries. BGP EVPN with Cumulus Quagga is an example of use of this strategy (see VXLAN: BGP EVPN with Cumulus Quagga for additional information).

Unicast with static L3 entries In the previous example, we had to keep the all-zero entries for ARP and IPv6 neighbor discovery to work correctly. However, Linux can answer to neighbor requests on behalf of the remote nodes4. When this feature is enabled, the default entries are not needed anymore (but you could keep them):
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning \
>   proxy
# ip -6 neigh add 2001:db8:ff::11 lladdr 50:54:33:00:00:09 dev vxlan100
# ip -6 neigh add 2001:db8:ff::12 lladdr 50:54:33:00:00:0a dev vxlan100
# ip -6 neigh add 2001:db8:ff::13 lladdr 50:54:33:00:00:0b dev vxlan100
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1
This setup totally eliminates head-end replication. However, protocols relying on multicast won t work either. With some automation, this is a setup that should work well with containers: if there is a registry keeping a list of all IP and MAC addresses in use, a program could listen to it and adjust the FDB and the neighbor tables. The VXLAN backend of Docker s libnetwork is an example of use of this strategy (but it also uses the next method).

Unicast with dynamic L3 entries Linux can also notify a program an (L2 or L3) entry is missing. The program queries some central registry and dynamically adds the requested entry. However, for L2 entries, notifications are issued only if:
  • the destination MAC address is not known,
  • there is no all-zero entry in the FDB, and
  • the destination MAC address is not a multicast or broadcast one.
Those limitations prevent us to do a unicast with dynamic L2 entries scenario. First, let s create the VXLAN device with the l2miss and l3miss options5:
ip -6 link add vxlan100 type vxlan \
   id 100 \
   dstport 4789 \
   local 2001:db8:1::1 \
   nolearning \
   l2miss \
   l3miss \
   proxy
Notifications are sent to programs listening to an AF_NETLINK socket using the NETLINK_ROUTE protocol. This socket needs to be bound to the RTNLGRP_NEIGH group. The following is doing exactly that and decodes the received notifications:
# ip monitor neigh dev vxlan100
miss 2001:db8:ff::12 STALE
miss lladdr 50:54:33:00:00:0a STALE
The first notification is about a missing neighbor entry for the requested IP address. We can add it with the following command:
ip -6 neigh replace 2001:db8:ff::12 \
    lladdr 50:54:33:00:00:0a \
    dev vxlan100 \
    nud reachable
The entry is not permanent so that we don t need to delete it when it expires. If the address becomes stale, we will get another notification to refresh it. Once the host receives our proxy answer for the neighbor discovery request, it can send a frame with the MAC we gave as destination. The second notification is about the missing FDB entry for this MAC address. We add the appropriate entry with the following command6:
bridge fdb replace 50:54:33:00:00:0a \
    dst 2001:db8:2::1 \
    dev vxlan100 dynamic
The entry is not permanent either as it would prevent the MAC to migrate to the local VTEP (a dynamic entry cannot override a permanent entry). This setup works well with containers and a global registry. However, there is small latency penalty for the first connections. Moreover, multicast and broadcast won t be available in the underlay network. The VXLAN backend for flannel, a network fabric for Kubernetes, is an example of this strategy.

Decision There is no one-size-fits-all solution. You should consider the multicast solution if:
  • you are in an environment where multicast is available,
  • you are ready to operate (and scale) a multicast network,
  • you need multicast and broadcast inside the virtual segments,
  • you don t have L2/L3 addresses available beforehand.
The scalability of such a solution is pretty good if you take care of not putting all VXLAN interfaces into the same multicast group (e.g. use the last byte of the VNI as the last byte of the multicast group). When multicast is not available, another generic solution is BGP EVPN: BGP is used as a controller to ensure distribution of the list of VTEPs and their respective FDBs. As mentioned earlier, an implementation of this solution is Cumulus Quagga. I explore this option in a separate post: VXLAN: BGP EVPN with Cumulus Quagga. If you operate in a container-like environment where L2/L3 addresses are known beforehand, a solution using static and/or dynamic L2 and L3 entries based on a central registry and no source-address learning would also fit the bill. This provides a more security-tight solution (bound resources, MiTM attacks dampened down, inability to amplify bandwidth usage through excessive broadcast). Various environment-specific solutions are available7 or you can build your own.

Other considerations Independently of the chosen strategy, here are a few important points to keep in mind when implementing a VXLAN overlay.

Isolation While you may expect VXLAN interfaces to only carry L2 traffic, Linux doesn t disable IP processing. If the destination MAC is a local one, Linux will route or deliver the encapsulated IP packet. Check my post about the proper isolation of a Linux bridge.

Encryption VXLAN enforces isolation between tenants, but the traffic is totally unencrypted. The most direct solution to provide encryption is to use IPsec. Some container-based solutions may come with IPsec support out-of-the box (notably Docker s libnetwork, but flannel has plan for it too). This is quite important for a deployment over a public cloud.

Overhead The format of a VXLAN-encapsulated frame is the following: VXLAN encapsulation VXLAN adds a fixed overhead of 50 bytes. If you also use IPsec, the overhead depends on many factors. In transport mode, with AES and SHA256, the overhead is 56 bytes. With NAT traversal, this is 64 bytes (additional UDP header). In tunnel mode, this is 72 bytes. See Cisco IPsec Overhead Calculator Tool. Some users will expect to be able to use an Ethernet MTU of 1500 for the overlay network. Therefore, the underlay MTU should be increased. If it is not possible, ensure the inner MTU (inside the containers or the virtual machines) is correctly decreased8.

IPv6 While all the examples above are using IPv6, the ecosystem is not quite ready yet. The multicast L2-only strategy works fine with IPv6 but every other scenario currently needs some patches (1, 2, 3). On top of that, IPv6 may not have been implemented in VXLAN-related tools:

Multicast Linux VXLAN implementation doesn t support IGMP snooping. Multicast traffic will be broadcasted to all VTEPs unless multicast MAC addresses are inserted into the FDB.

  1. This is one possible implementation. The bridge is only needed if you require some form of source-address learning for local interfaces. Another strategy is to use MACVLAN interfaces.
  2. The underlay multicast network may still need some central components, like rendez-vous points for PIM-SM protocol. Fortunately, it s possible to make them highly available and scalable (e.g. with Anycast-RP, RFC 4610).
  3. For this example and the following ones, a patch is needed for the ip command (to be included in 4.11) to use IPv6 for transport. In the meantime, here is a quick workaround:
    # ip -6 link add vxlan100 type vxlan \
    >   id 100 \
    >   dstport 4789 \
    >   local 2001:db8:1::1 \
    >   remote 2001:db8:2::1
    # bridge fdb append 00:00:00:00:00:00 \
    >   dev vxlan100 dst 2001:db8:3::1
    
  4. You may have to apply an IPv6-related patch to the kernel (to be included in 4.12).
  5. You have to apply an IPv6-related patch to the kernel (to be included in 4.12) to get appropriate notifications for missing IPv6 addresses.
  6. Directly adding the entry after the first notification would have been smarter to avoid unnecessary retransmissions.
  7. flannel and Docker s libnetwork were already mentioned as they both feature a VXLAN backend. There are also some interesting experiments like BaGPipe BGP for Kubernetes which leverages BGP EVPN and is therefore interoperable with other vendors.
  8. There is no such thing as MTU discovery on an Ethernet segment.

12 April 2017

Vincent Bernat: Proper isolation of a Linux bridge

TL;DR: when configuring a Linux bridge, use the following commands to enforce isolation:
# bridge vlan del dev br0 vid 1 self
# echo 1 > /sys/class/net/br0/bridge/vlan_filtering

A network bridge (also commonly called a switch ) brings several Ethernet segments together. It is a common element in most infrastructures. Linux provides its own implementation. A typical use of a Linux bridge is shown below. The hypervisor is running three virtual hosts. Each virtual host is attached to the br0 bridge (represented by the horizontal segment). The hypervisor has two physical network interfaces: Typical use of Linux bridging with virtual machines The main expectation of such a setup is that while the virtual hosts should be able to use resources from the public network, they should not be able to access resources from the infrastructure network (including resources hosted on the hypervisor itself, like a SSH server). In other words, we expect a total isolation between the green domain and the purple one. That s not the case. From any virtual host:
# ip route add 192.168.14.3/32 dev eth0
# ping -c 3 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=59 time=0.644 ms
64 bytes from 192.168.14.3: icmp_seq=2 ttl=59 time=0.829 ms
64 bytes from 192.168.14.3: icmp_seq=3 ttl=59 time=0.894 ms
--- 192.168.14.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2033ms
rtt min/avg/max/mdev = 0.644/0.789/0.894/0.105 ms

Why? There are two main factors of this behavior:
  1. A bridge can accept IP traffic. This is a useful feature if you want Linux to act as a bridge and provide some IP services to bridge users (a DHCP relay or a default gateway). This is usually done by configuring the IP address on the bridge device: ip addr add 192.0.2.2/25 dev br0.
  2. An interface doesn t need an IP address to process incoming IP traffic. Additionally, by default, Linux accepts to answer ARP requests independently from the incoming interface.

Bridge processing After turning an incoming Ethernet frame into a socket buffer, the network driver transfers the buffer to the netif_receive_skb() function. The following actions are executed:
  1. copy the frame to any registered global or per-device taps (e.g. tcpdump),
  2. evaluate the ingress policy (configured with tc),
  3. hand over the frame to the device-specific receive handler, if any,
  4. hand over the frame to a global or device-specific protocol handler (e.g. IPv4, ARP, IPv6).
For a bridged interface, the kernel has configured a device-specific receive handler, br_handle_frame(). This function won t allow any additional processing in the context of the incoming interface, except for STP and LLDP frames or if brouting is enabled1. Therefore, the protocol handlers are never executed in this case. After a few additional checks, Linux will decide if the frame has to be locally delivered:
  • the entry for the target MAC in the FDB is marked for local delivery, or
  • the target MAC is a broadcast or a multicast address.
In this case, the frame is passed to the br_pass_frame_up() function. A VLAN-related check is optionally performed. The socket buffer is attached to the bridge interface (br0) instead of the physical interface (eth0), is evaluated by Netfilter and sent back to netif_receive_skb(). It will go through the four steps a second time.

IPv4 processing When a device doesn t have a protocol-independent receive handler, a protocol-specific handler will be used:
# cat /proc/net/ptype
Type Device      Function
0800          ip_rcv
0011          llc_rcv [llc]
0004          llc_rcv [llc]
0806          arp_rcv
86dd          ipv6_rcv
Therefore, if the Ethernet type of the incoming frame is 0x800, the socket buffer is handled by ip_rcv(). Among other things, the three following steps will happen:
  • If the frame destination address is not the MAC address of the incoming interface, not a multicast one and not a broadcast one, the frame is dropped ( not for us ).
  • Netfilter gets a chance to evaluate the packet (in a PREROUTING chain).
  • The routing subsystem will decide the destination of the packet in ip_route_input_slow(): is it a local packet, should it be forwarded, should it be dropped, should it be encapsulated? Notably, the reverse-path filtering is done during this evaluation in fib_validate_source().
Reverse-path filtering (also known as uRPF, or unicast reverse-path forwarding, RFC 3704) enables Linux to reject traffic on interfaces which it should never have originated: the source address is looked up in the routing tables and if the outgoing interface is different from the current incoming one, the packet is rejected.

ARP processing When the Ethernet type of the incoming frame is 0x806, the socket buffer is handled by arp_rcv().
  • Like for IPv4, if the frame is not for us, it is dropped.
  • If the incoming device has the NOARP flag, the frame is dropped.
  • Netfilter gets a chance to evaluate the packet (configuration is done with arptables).
  • For an ARP request, the values of arp_ignore and arp_filter may trigger a drop of the packet.

IPv6 processing When the Ethernet type of the incoming frame is 0x86dd, the socket buffer is handled by ipv6_rcv().
  • Like for IPv4, if the frame is not for us, it is dropped.
  • If IPv6 is disabled on the interface, the packet is dropped.
  • Netfilter gets a chance to evaluate the packet (in a PREROUTING chain).
  • The routing subsystem will decide the destination of the packet. However, unlike IPv4, there is no reverse-path filtering2.

Workarounds There are various methods to fix the situation. We can completely ignore the bridged interfaces: as long as they are attached to the bridge, they cannot process any upper layer protocol (IPv4, IPv6, ARP). Therefore, we can focus on filtering incoming traffic from br0. It should be noted that for IPv4, IPv6 and ARP protocols, the MAC address check can be circumvented by using the broadcast MAC address.

Protocol-independent workarounds The four following fixes will indistinctly drop IPv4, ARP and IPv6 packets.

Using VLAN-aware bridge Linux 3.9 introduced the ability to use VLAN filtering on bridge ports. This can be used to prevent any local traffic:
# echo 1 > /sys/class/net/br0/bridge/vlan_filtering
# bridge vlan del dev br0 vid 1 self
# bridge vlan show
port    vlan ids
eth0     1 PVID Egress Untagged
eth2     1 PVID Egress Untagged
eth3     1 PVID Egress Untagged
eth4     1 PVID Egress Untagged
br0     None
This is the most efficient method since the frame is dropped directly in br_pass_frame_up().

Using ingress policy It s also possible to drop the bridged frame early after it has been re-delivered to netif_receive_skb() by br_pass_frame_up(). The ingress policy of an interface is evaluated before any handler. Therefore, the following commands will ensure no local delivery (the source interface of the packet is the bridge interface) happens:
# tc qdisc add dev br0 handle ffff: ingress
# tc filter add dev br0 parent ffff: u32 match u8 0 0 action drop
In my opinion, this is the second most efficient method.

Using ebtables Just before re-delivering the frame to netif_receive_skb(), Netfilter gets a chance to issue a decision. It s easy to configure it to drop the frame:
# ebtables -A INPUT --logical-in br0 -j DROP
However, to the best of my knowledge, this part of Netfilter is known to be inefficient.

Using namespaces Isolation can also be obtained by moving all the bridged interfaces into a dedicated network namespace and configure the bridge inside this namespace:
# ip netns add bridge0
# ip link set netns bridge0 eth0
# ip link set netns bridge0 eth2
# ip link set netns bridge0 eth3
# ip link set netns bridge0 eth4
# ip link del dev br0
# ip netns exec bridge0 brctl addbr br0
# for i in 0 2 3 4; do
>    ip netns exec bridge0 brctl addif br0 eth$i
>    ip netns exec bridge0 ip link set up dev eth$i
> done
# ip netns exec bridge0 ip link set up dev br0
The frame will still wander a bit inside the IP stack, wasting some CPU cycles and increasing the possible attack surface. But ultimately, it will be dropped.

Protocol-dependent workarounds Unless you require multiple layers of security, if one of the previous workarounds is already applied, there is no need to apply one of the protocol-dependent fix below. It s still interesting to know them because it is not uncommon to already have them in place.

ARP The easiest way to disable ARP processing on a bridge is to set the NOARP flag on the device. The ARP packet will be dropped as the very first step of the ARP handler.
# ip link set arp off dev br0
# ip l l dev br0
8: br0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 50:54:33:00:00:04 brd ff:ff:ff:ff:ff:ff
arptables can also drop the packet quite early:
# arptables -A INPUT -i br0 -j DROP
Another way is to set arp_ignore to 2 for the given interface. The kernel will only answer to ARP requests whose target IP address is configured on the incoming interface. Since the bridge interface doesn t have any IP address, no ARP requests will be answered.
# sysctl -qw net.ipv4.conf.br0.arp_ignore=2
Disabling ARP processing is not a sufficient workaround for IPv4. A user can still insert the appropriate entry in its neighbor cache:
# ip neigh replace 192.168.14.3 lladdr 50:54:33:00:00:04 dev eth0
# ping -c 1 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=49 time=1.30 ms
--- 192.168.14.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.309/1.309/1.309/0.000 ms
As the check on the target MAC address is quite loose, they don t even need to guess the MAC address:
# ip neigh replace 192.168.14.3 lladdr ff:ff:ff:ff:ff:ff dev eth0
# ping -c 1 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=49 time=1.12 ms
--- 192.168.14.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.129/1.129/1.129/0.000 ms

IPv4 The earliest place to drop an IPv4 packet is with Netfilter3:
# iptables -t raw -I PREROUTING -i br0 -j DROP
If Netfilter is disabled, another possibility is to enable strict reverse-path filtering for the interface. In this case, since there is no IP address configured on the interface, the packet will be dropped during the route lookup:
# sysctl -qw net.ipv4.conf.br0.rp_filter=1
Another option is the use of a dedicated routing rule. Compared to the reverse-path filtering option, the packet will be dropped a bit earlier, still during the route lookup.
# ip rule add iif br0 blackhole

IPv6 Linux provides a way to completely disable IPv6 on a given interface. The packet will be dropped as the very first step of the IPv6 handler:
# sysctl -qw net.ipv6.conf.br0.disable_ipv6=1
Like for IPv4, it s possible to use Netfilter or a dedicated routing rule.

About the example In the above example, the virtual host get ICMP replies because they are routed through the infrastructure network to Internet (e.g. the hypervisor has a default gateway which also acts as a NAT router to Internet). This may not be the case. If you want to check if you are vulnerable despite not getting an ICMP reply, look at the guest neighbor table to check if you got an ARP reply from the host:
# ip route add 192.168.14.3/32 dev eth0
# ip neigh show dev eth0
192.168.14.3 lladdr 50:54:33:00:00:04 REACHABLE
If you didn t get a reply, you could still have issues with IP processing. Add a static neighbor entry before checking the next step:
# ip neigh replace 192.168.14.3 lladdr ff:ff:ff:ff:ff:ff dev eth0
To check if IP processing is enabled, check the bridge host s network statistics:
# netstat -s   grep "ICMP messages"
    15 ICMP messages received
    15 ICMP messages sent
    0 ICMP messages failed
If the counters are increasing, it is processing incoming IP packets. One-way communication still allows a lot of bad things, like DoS attacks. Additionally, if the hypervisor happens to also act as a router, the reach is extended to the whole infrastructure network, potentially exposing weak devices (e.g. PDU) exposing an SNMP agent. If one-way communication is all that s needed, the attacker can also spoof its source IP address, bypassing IP-based authentication.

  1. A frame can be forcibly routed (L3) instead of bridged (L2) by brouting the packet. This action can be triggered using ebtables.
  2. For IPv6, reverse-path filtering needs to be implemented with Netfilter, using the rpfilter match.
  3. If the br_netfilter module is loaded, net.bridge.bridge-nf-call-ipatbles sysctl has to be set to 0. Otherwise, you also need to use the physdev match to not drop IPv4 packets going through the bridge.

5 March 2017

Vincent Bernat: Netops with Emacs and Org mode

Org mode is a package for Emacs to keep notes, maintain todo lists, planning projects and authoring documents . It can execute embedded snippets of code and capture the output (through Babel). It s an invaluable tool for documenting your infrastructure and your operations. Here are three (relatively) short videos exhibiting Org mode use in the context of network operations. In all of them, I am using my own junos-mode which features the following perks: Since some Junos devices can be quite slow, commits and remote executions are done asynchronously1 with the help of a Python helper. In the first video, I take some notes about configuring BGP add-path feature (RFC 7911). It demonstrates all the available features of junos-mode. In the second video, I execute a planned operation to enable this feature in production. The document is a modus operandi and contains the configuration to apply and the commands to check if it works as expected. At the end, the document becomes a detailed report of the operation. In the third video, a cookbook has been prepared to execute some changes. I set some variables and execute the cookbook to apply the change and check the result.

  1. This is a bit of a hack since Babel doesn t have native support for that. Also have a look at ob-async which is a language-independent implementation of the same idea.

9 February 2017

Vincent Bernat: Integration of a Go service with systemd

Unlike other programming languages, Go s runtime doesn t provide a way to reliably daemonize a service. A system daemon has to supply this functionality. Most distributions ship systemd which would fit the bill. A correct integration with systemd is quite straightforward. There are two interesting aspects: readiness & liveness. As an example, we will daemonize this service whose goal is to answer requests with nifty 404 errors:
package main
import (
    "log"
    "net"
    "net/http"
)
func main()  
    l, err := net.Listen("tcp", ":8081")
    if err != nil  
        log.Panicf("cannot listen: %s", err)
     
    http.Serve(l, nil)
 
You can build it with go build 404.go. Here is the service file, 404.service1:
[Unit]
Description=404 micro-service
[Service]
Type=notify
ExecStart=/usr/bin/404
WatchdogSec=30s
Restart=on-failure
[Install]
WantedBy=multi-user.target

Readiness The classic way for an Unix daemon to signal its readiness is to daemonize. Technically, this is done by calling fork(2) twice (which also serves other intents). This is a very common task and the BSD systems, as well as some other C libraries, supply a daemon(3) function for this purpose. Services are expected to daemonize only when they are ready (after reading configuration files and setting up a listening socket, for example). Then, a system can reliably initialize its services with a simple linear script:
syslogd
unbound
ntpd -s
Each daemon can rely on the previous one being ready to do its work. The sequence of actions is the following:
  1. syslogd reads its configuration, activates /dev/log, daemonizes.
  2. unbound reads its configuration, listens on 127.0.0.1:53, daemonizes.
  3. ntpd reads its configuration, connects to NTP peers, waits for clock to be synchronized2, daemonizes.
With systemd, we would use Type=fork in the service file. However, Go s runtime does not support that. Instead, we use Type=notify. In this case, systemd expects the daemon to signal its readiness with a message to an Unix socket. go-systemd package handles the details for us:
package main
import (
    "log"
    "net"
    "net/http"
    "github.com/coreos/go-systemd/daemon"
)
func main()  
    l, err := net.Listen("tcp", ":8081")
    if err != nil  
        log.Panicf("cannot listen: %s", err)
     
    daemon.SdNotify(false, "READY=1") //  
    http.Serve(l, nil)                //  
 
It s important to place the notification after net.Listen() (in ): if the notification was sent earlier, a client would get connection refused when trying to use the service. When a daemon listens to a socket, connections are queued by the kernel until the daemon is able to accept them (in ). If the service is not run through systemd, the added line is a no-op.

Liveness Another interesting feature of systemd is to watch the service and restart it if it happens to crash (thanks to the Restart=on-failure directive). It s also possible to use a watchdog: the service sends watchdog keep-alives at regular interval. If it fails to do so, systemd will restart it. We could insert the following code just before http.Serve() call:
go func()  
    interval, err := daemon.SdWatchdogEnabled(false)
    if err != nil   interval == 0  
        return
     
    for  
        daemon.SdNotify(false, "WATCHDOG=1")
        time.Sleep(interval / 3)
     
 ()
However, this doesn t add much value: the goroutine is unrelated to the core business of the service. If for some reason, the HTTP part gets stuck, the goroutine will happily continue to send keep-alives to systemd. In our example, we can just do a HTTP query before sending the keep-alive. The internal loop can be replaced with this code:
for  
    _, err := http.Get("http://127.0.0.1:8081") //  
    if err == nil  
        daemon.SdNotify(false, "WATCHDOG=1")
     
    time.Sleep(interval / 3)
 
In , we connect to the service to check if it s still working. If we get some kind of answer, we send a watchdog keep-alive. If the service is unavailable or if http.Get() gets stuck, systemd will trigger a restart. There is no universal recipe. However, checks can be split into two groups:
  • Before sending a keep-alive, you execute an active check on the components of your service. The keep-alive is sent only if all checks are successful. The checks can be internal (like in the above example) or external (for example, check with a query to the database).
  • Each component reports its status, telling if it s alive or not. Before sending a keep-alive, you check the reported status of all components (passive check). If some components are late or reported fatal errors, don t send the keep-alive.
If possible, recovery from errors (for example, with a backoff retry) and self-healing (for example, by reestablishing a network connection) is always better, but the watchdog is a good tool to handle the worst cases and avoid too complex recovery logic. For example, if a component doesn t know how to recover from an exceptional condition3, instead of using panic(), it could signal its situation before dying. Another dedicated component could try to resolve the situation by restarting the faulty component. If it fails to reach an healthy state in time, the watchdog timer will trigger and the whole service will be restarted.

  1. Depending on the distribution, this should be installed in /lib/systemd/system or /usr/lib/systemd/system. Check with the output of the command pkg-config systemd --variable=systemdsystemunitdir.
  2. This highly depends on the NTP daemon used. OpenNTPD doesn t wait unless you use the -s option. ISC NTP doesn t either unless you use the --wait-sync option.
  3. An example of an exceptional condition is to reach the limit on the number of file descriptors. Self-healing from this situation is difficult and it s easy to get stuck in a loop.

Steve Kemp: Old packages are interesting.

Recently Vincent Bernat wrote about writing his own simple terminal, using vte. That was a fun read, as the sample code built really easily and was functional. At the end of his post he said :
evilvte is quite customizable and can be lightweight. Consider it as a first alternative. Honestly, I don t remember why I didn t pick it.
That set me off looking at evilvte, and it was one of those rare projects which seems to be pretty stable, and also hasn't changed in any recent release of Debian GNU/Linux: I wonder if it would be possible to easily generate a list of packages which have the same revision in multiple distributions? Anyway I had a look at the source, and unfortunately spotted that it didn't entirely handle clicking on hyperlinks terribly well. Clicking on a link would pretty much run:
 firefox '%s'
That meant there was an obvious security problem. It is a great terminal though, and it just goes to show how short, simple, and readable such things can be. I enjoyed looking at the source, and furthermore enjoyed using it. Unfortunately due to a dependency issue it looks like this package will be removed from stretch.

8 February 2017

Steve Kemp: Old packages are interesting.

Recently Vincent Bernat wrote about writing his own simple terminal, using vte. That was a fun read, as the sample code built really easily and was functional. At the end of his post he said :
evilvte is quite customizable and can be lightweight. Consider it as a first alternative. Honestly, I don t remember why I didn t pick it.
That set me off looking at evilvte, and it was one of those rare projects which seems to be pretty stable, and also hasn't changed in any recent release of Debian GNU/Linux: I wonder if it would be possible to easily generate a list of packages which have the same revision in multiple distributions? Anyway I had a look at the source, and unfortunately spotted that it didn't entirely handle clicking on hyperlinks terribly well. Clicking on a link would pretty much run:
 firefox '%s'
That meant there was an obvious security problem. It is a great terminal though, and it just goes to show how short, simple, and readable such things can be. I enjoyed looking at the source, and furthermore enjoyed using it. Unfortunately due to a dependency issue it looks like this package will be removed from stretch.

7 February 2017

Vincent Bernat: Write your own terminal emulator

I was an happy user of rxvt-unicode until I got a laptop with an HiDPI display. Switching from a LoDPI to a HiDPI screen and back was a pain: I had to manually adjust the font size on all terminals or restart them. VTE is a library to build a terminal emulator using the GTK+ toolkit, which handles DPI changes. It is used by many terminal emulators, like GNOME Terminal, evilvte, sakura, termit and ROXTerm. The library is quite straightforward and writing a terminal doesn t take much time if you don t need many features. Let s see how to write a simple one.

A simple terminal Let s start small with a terminal with the default settings. We ll write that in C. Another supported option is Vala.
#include <vte/vte.h>
int
main(int argc, char *argv[])
 
    GtkWidget *window, *terminal;
    /* Initialise GTK, the window and the terminal */
    gtk_init(&argc, &argv);
    terminal = vte_terminal_new();
    window = gtk_window_new(GTK_WINDOW_TOPLEVEL);
    gtk_window_set_title(GTK_WINDOW(window), "myterm");
    /* Start a new shell */
    gchar **envp = g_get_environ();
    gchar **command = (gchar *[]) g_strdup(g_environ_getenv(envp, "SHELL")), NULL  ;
    g_strfreev(envp);
    vte_terminal_spawn_sync(VTE_TERMINAL(terminal),
        VTE_PTY_DEFAULT,
        NULL,       /* working directory  */
        command,    /* command */
        NULL,       /* environment */
        0,          /* spawn flags */
        NULL, NULL, /* child setup */
        NULL,       /* child pid */
        NULL, NULL);
    /* Connect some signals */
    g_signal_connect(window, "delete-event", gtk_main_quit, NULL);
    g_signal_connect(terminal, "child-exited", gtk_main_quit, NULL);
    /* Put widgets together and run the main loop */
    gtk_container_add(GTK_CONTAINER(window), terminal);
    gtk_widget_show_all(window);
    gtk_main();
 
You can compile it with the following command:
gcc -O2 -Wall $(pkg-config --cflags --libs vte-2.91) term.c -o term
And run it with ./term: Simple VTE-based terminal

More features From here, you can have a look at the documentation to alter behavior or add more features. Here are three examples.

Colors You can define the 16 basic colors with the following code:
#define CLR_R(x)   (((x) & 0xff0000) >> 16)
#define CLR_G(x)   (((x) & 0x00ff00) >>  8)
#define CLR_B(x)   (((x) & 0x0000ff) >>  0)
#define CLR_16(x)  ((double)(x) / 0xff)
#define CLR_GDK(x) (const GdkRGBA)  .red = CLR_16(CLR_R(x)), \
                                    .green = CLR_16(CLR_G(x)), \
                                    .blue = CLR_16(CLR_B(x)), \
                                    .alpha = 0  
vte_terminal_set_colors(VTE_TERMINAL(terminal),
    &CLR_GDK(0xffffff),
    &(GdkRGBA)  .alpha = 0.85  ,
    (const GdkRGBA[]) 
        CLR_GDK(0x111111),
        CLR_GDK(0xd36265),
        CLR_GDK(0xaece91),
        CLR_GDK(0xe7e18c),
        CLR_GDK(0x5297cf),
        CLR_GDK(0x963c59),
        CLR_GDK(0x5E7175),
        CLR_GDK(0xbebebe),
        CLR_GDK(0x666666),
        CLR_GDK(0xef8171),
        CLR_GDK(0xcfefb3),
        CLR_GDK(0xfff796),
        CLR_GDK(0x74b8ef),
        CLR_GDK(0xb85e7b),
        CLR_GDK(0xA3BABF),
        CLR_GDK(0xffffff)
 , 16);
While you can t see it on the screenshot1, this also enables background transparency. Color rendering

Miscellaneous settings VTE comes with many settings to change the behavior of the terminal. Consider the following code:
vte_terminal_set_scrollback_lines(VTE_TERMINAL(terminal), 0);
vte_terminal_set_scroll_on_output(VTE_TERMINAL(terminal), FALSE);
vte_terminal_set_scroll_on_keystroke(VTE_TERMINAL(terminal), TRUE);
vte_terminal_set_rewrap_on_resize(VTE_TERMINAL(terminal), TRUE);
vte_terminal_set_mouse_autohide(VTE_TERMINAL(terminal), TRUE);
This will:
  • disable the scrollback buffer,
  • not scroll to the bottom on new output,
  • scroll to the bottom on keystroke,
  • rewrap content when terminal size change, and
  • hide the mouse cursor when typing.

Update the window title An application can change the window title using XTerm control sequences (for example, with printf "\e]2;$ title \a"). If you want the actual window title to reflect this, you need to define this function:
static gboolean
on_title_changed(GtkWidget *terminal, gpointer user_data)
 
    GtkWindow *window = user_data;
    gtk_window_set_title(window,
        vte_terminal_get_window_title(VTE_TERMINAL(terminal))?:"Terminal");
    return TRUE;
 
Then, connect it to the appropriate signal, in main():
g_signal_connect(terminal, "window-title-changed", 
    G_CALLBACK(on_title_changed), GTK_WINDOW(window));

Final words I don t need much more as I am using tmux inside each terminal. In my own copy, I have also added the ability to complete a word using ones from the current window or other windows (also known as dynamic abbrev expansion). This requires to implement a terminal daemon to handle all terminal windows with one process, similar to urxvtcd. While writing a terminal from scratch 2 suits my need, it may not be worth it. evilvte is quite customizable and can be lightweight. Consider it as a first alternative. Honestly, I don t remember why I didn t pick it. UPDATED: evilvte has not seen an update since 2014. Its GTK+3 support is buggy. It doesn t support the latest versions of the VTE library. Therefore, it s not a good idea to use it. You should also note that the primary goal of VTE is to be a library to support GNOME Terminal. Notably, if a feature is not needed for GNOME Terminal, it won t be added to VTE. If it already exists, it will likely to be deprecated and removed.

  1. Transparency is handled by the composite manager (Compton, in my case).
  2. For some definition of scratch since the hard work is handled by VTE.

5 February 2017

Vincent Bernat: A Makefile for your Go project

My most loathed feature of Go is the mandatory use of GOPATH: I do not want to put my own code next to its dependencies. Hopefully, this issue is slowly starting to be accepted by the main authors. In the meantime, you can workaround this problem with more opinionated tools (like gb) or by crafting your own Makefile. For the later, you can have a look at Filippo Valsorda s example or my own take which I describe in more details here. This is not meant to be an universal Makefile but a relatively short one with some batteries included. It comes with a simple Hello World! application.

Project structure For a standalone project, vendoring is a must-have1 as you cannot rely on your dependencies to not introduce backward-incompatible changes. Some packages are using versioned URLs but most of them aren t. There is currently no standard tool to handle vendoring. My personal take is to vendor all dependencies with Glide. It is a good practice to split an application into different packages while the main one stay fairly small. In the hellogopher example, the CLI is handled in the cmd package while the application logic for printing greetings is in the hello package:
.
  cmd/
    hello.go
    root.go
    version.go
  glide.lock (generated)
  glide.yaml
  vendor/ (dependencies will go there)
  hello/
    root.go
    root_test.go
  main.go
  Makefile
  README.md

Down the rabbit hole Let s take a look at the various features of the Makefile.

GOPATH handling Since all dependencies are vendored, only our own project needs to be in the GOPATH:
PACKAGE  = hellogopher
GOPATH   = $(CURDIR)/.gopath
BASE     = $(GOPATH)/src/$(PACKAGE)
$(BASE):
    @mkdir -p $(dir $@)
    @ln -sf $(CURDIR) $@
The base import path is hellogopher, not github.com/vincentbernat/hellogopher: this shortens imports and makes them easily distinguishable from imports of dependency packages. However, your application won t be go get-able. This is a personal choice and can be adjusted with the $(PACKAGE) variable. We just create a symlink from .gopath/src/hellogopher to our root directory. The GOPATH environment variable is automatically exported to the shell commands of the recipes. Any tool should work fine after changing the current directory to $(BASE). For example, this snippet builds the executable:
.PHONY: all
all:   $(BASE)
    cd $(BASE) && $(GO) build -o bin/$(PACKAGE) main.go

Vendoring dependencies Glide is a bit like Ruby s Bundler. In glide.yaml, you specify what packages you need and the constraints you want on them. Glide computes a glide.lock file containing the exact versions for each dependencies (including recursive dependencies) and download them in the vendor/ folder. I choose to check into the VCS both glide.yaml and glide.lock files. It s also possible to only check in the first one or to also check in the vendor/ directory. A work-in-progress is currently ongoing to provide a standard dependency management tool with a similar workflow. We define two rules2:
GLIDE = glide
glide.lock: glide.yaml   $(BASE)
    cd $(BASE) && $(GLIDE) update
    @touch $@
vendor: glide.lock   $(BASE)
    cd $(BASE) && $(GLIDE) --quiet install
    @ln -sf . vendor/src
    @touch $@
We use a variable to invoke glide. This enables a user to easily override it (for example, with make GLIDE=$GOPATH/bin/glide).

Using third-party tools Most projects need some third-party tools. We can either expect them to be already installed or compile them in our private GOPATH. For example, here is the lint rule:
BIN    = $(GOPATH)/bin
GOLINT = $(BIN)/golint
$(BIN)/golint:   $(BASE) #  
    go get github.com/golang/lint/golint
.PHONY: lint
lint: vendor   $(BASE) $(GOLINT) #  
    @cd $(BASE) && ret=0 && for pkg in $(PKGS); do \
        test -z "$$($(GOLINT) $$pkg   tee /dev/stderr)"   ret=1 ; \
     done ; exit $$ret
As for glide, we let the user a chance to override which golint executable to use. By default, it uses a private copy. But a user can use its own copy with make GOLINT=/usr/bin/golint. In , we have the recipe to build the private copy. We simply issue go get3 to download and build golint. In , the lint rule executes golint on each package contained in the $(PKGS) variable. We ll explain this variable in the next section.

Working with non-vendored packages only Some commands need to be provided with a list of packages. Because we use a vendor/ directory, the shortcut ./... is not what we expect as we don t want to run tests on our dependencies4. Therefore, we compose a list of packages we care about:
PKGS = $(or $(PKG), $(shell cd $(BASE) && \
    env GOPATH=$(GOPATH) $(GO) list ./...   grep -v "^$(PACKAGE)/vendor/"))
If the user has provided the $(PKG) variable, we use it. For example, if they want to lint only the cmd package, they can invoke make lint PKG=hellogopher/cmd which is more intuitive than specifying PKGS. Otherwise, we just execute go list ./... but we remove anything from the vendor directory.

Tests Here are some rules to run tests:
TIMEOUT = 20
TEST_TARGETS := test-default test-bench test-short test-verbose test-race
.PHONY: $(TEST_TARGETS) check test tests
test-bench:   ARGS=-run=__absolutelynothing__ -bench=.
test-short:   ARGS=-short
test-verbose: ARGS=-v
test-race:    ARGS=-race
$(TEST_TARGETS): test
check test tests: fmt lint vendor   $(BASE)
    @cd $(BASE) && $(GO) test -timeout $(TIMEOUT)s $(ARGS) $(PKGS)
A user can invoke tests in different ways:
  • make test runs all tests;
  • make test TIMEOUT=10 runs all tests with a timeout of 10 seconds;
  • make test PKG=hellogopher/cmd only runs tests for the cmd package;
  • make test ARGS="-v -short" runs tests with the specified arguments;
  • make test-race runs tests with race detector enabled.

Tests coverage go test includes a test coverage tool. Unfortunately, it only handles one package at a time and you have to explicitely list the packages to be instrumented, otherwise the instrumentation is limited to the currently tested package. If you provide too many packages, the compilation time will skyrocket. Moreover, if you want an output compatible with Jenkins, you ll need some additional tools.
COVERAGE_MODE    = atomic
COVERAGE_PROFILE = $(COVERAGE_DIR)/profile.out
COVERAGE_XML     = $(COVERAGE_DIR)/coverage.xml
COVERAGE_HTML    = $(COVERAGE_DIR)/index.html
.PHONY: test-coverage test-coverage-tools
test-coverage-tools:   $(GOCOVMERGE) $(GOCOV) $(GOCOVXML) #  
test-coverage: COVERAGE_DIR := $(CURDIR)/test/coverage.$(shell date -Iseconds)
test-coverage: fmt lint vendor test-coverage-tools   $(BASE)
    @mkdir -p $(COVERAGE_DIR)/coverage
    @cd $(BASE) && for pkg in $(PKGS); do \ #  
        $(GO) test \
            -coverpkg=$$($(GO) list -f '  join .Deps "\n"  ' $$pkg   \
                    grep '^$(PACKAGE)/'   grep -v '^$(PACKAGE)/vendor/'   \
                    tr '\n' ',')$$pkg \
            -covermode=$(COVERAGE_MODE) \
            -coverprofile="$(COVERAGE_DIR)/coverage/ echo $$pkg   tr "/" "-" .cover" $$pkg ;\
     done
    @$(GOCOVMERGE) $(COVERAGE_DIR)/coverage/*.cover > $(COVERAGE_PROFILE)
    @$(GO) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_HTML)
    @$(GOCOV) convert $(COVERAGE_PROFILE)   $(GOCOVXML) > $(COVERAGE_XML)
First, we define some variables to let the user override them. We also require the following tools (in ):
  • gocovmerge merges profiles from different runs into a single one;
  • gocov-xml converts a coverage profile to the Cobertura format;
  • gocov is needed to convert a coverage profile to a format handled by gocov-xml.
The rules to build those tools are similar to the rule for golint described a few sections ago. In , for each package to test, we run go test with the -coverprofile argument. We also explicitely provide the list of packages to instrument to -coverpkg by using go list to get a list of dependencies for the tested package and keeping only our owns.

Final result While the main goal of using a Makefile was to work around GOPATH, it s also a good place to hide the complexity of some operations, notably around test coverage. The excerpts provided in this post are a bit simplified. Have a look at the final result for more perks!

  1. In Go, vendoring is about both bundling and dependency management. As the Go ecosystem matures, the bundling part (fixed snapshots of dependencies) may become optional but the vendor/ directory may stay for dependency management (retrieval of the latest versions of dependencies matching a set of constraints).
  2. If you don t want to automatically update glide.lock when a change is detected in glide.yaml, rename the target to deps-update and make it a phony target.
  3. There is some irony for bad mouthing go get and then immediately use it because it is convenient.
  4. I think ./... should not include the vendor/ directory by default. Dependencies should be trusted to have run their own tests in the environment they expect them to succeed. Unfortunately, this is unlikely to change.

3 December 2016

Vincent Bernat: Build-time dependency patching for Android

This post shows how to patch an external dependency for an Android project at build-time with Gradle. This leverages the Transform API and Javassist, a Java bytecode manipulation tool.
buildscript  
    dependencies  
        classpath 'com.android.tools.build:gradle:2.2.+'
        classpath 'com.android.tools.build:transform-api:1.5.+'
        classpath 'org.javassist:javassist:3.21.+'
        classpath 'commons-io:commons-io:2.4'
     
 
Disclaimer: I am not a seasoned Android programmer, so take this with a grain of salt.

Context This section adds some context to the example. Feel free to skip it. Dashkiosk is an application to manage dashboards on many displays. It provides an Android application you can install on one of those cheap Android sticks. Under the table, the application is an embedded webview backed by the Crosswalk Project web runtime which brings an up-to-date web engine, even for older versions of Android1. Recently, a security vulnerability has been spotted in how invalid certificates were handled. When a certificate cannot be verified, the webview defers the decision to the host application by calling the onReceivedSslError() method:
Notify the host application that an SSL error occurred while loading a resource. The host application must call either callback.onReceiveValue(true) or callback.onReceiveValue(false). Note that the decision may be retained for use in response to future SSL errors. The default behavior is to pop up a dialog.
The default behavior is specific to Crosswalk webview: the Android builtin one just cancels the load. Unfortunately, the fix applied by Crosswalk is different and, as a side effect, the onReceivedSslError() method is not invoked anymore2. Dashkiosk comes with an option to ignore TLS errors3. The mentioned security fix breaks this feature. The following example will demonstrate how to patch Crosswalk to recover the previous behavior4.

Simple method replacement Let s replace the shouldDenyRequest() method from the org.xwalk.core.internal.SslUtil class with this version:
// In SslUtil class
public static boolean shouldDenyRequest(int error)  
    return false;
 

Transform registration Gradle Transform API enables the manipulation of compiled class files before they are converted to DEX files. To declare a transform and register it, include the following code in your build.gradle:
import com.android.build.api.transform.Context
import com.android.build.api.transform.QualifiedContent
import com.android.build.api.transform.Transform
import com.android.build.api.transform.TransformException
import com.android.build.api.transform.TransformInput
import com.android.build.api.transform.TransformOutputProvider
import org.gradle.api.logging.Logger
class PatchXWalkTransform extends Transform  
    Logger logger = null;
    public PatchXWalkTransform(Logger logger)  
        this.logger = logger
     
    @Override
    String getName()  
        return "PatchXWalk"
     
    @Override
    Set<QualifiedContent.ContentType> getInputTypes()  
        return Collections.singleton(QualifiedContent.DefaultContentType.CLASSES)
     
    @Override
    Set<QualifiedContent.Scope> getScopes()  
        return Collections.singleton(QualifiedContent.Scope.EXTERNAL_LIBRARIES)
     
    @Override
    boolean isIncremental()  
        return true
     
    @Override
    void transform(Context context,
                   Collection<TransformInput> inputs,
                   Collection<TransformInput> referencedInputs,
                   TransformOutputProvider outputProvider,
                   boolean isIncremental) throws IOException, TransformException, InterruptedException  
        // We should do something here
     
 
// Register the transform
android.registerTransform(new PatchXWalkTransform(logger))
The getInputTypes() method should return the set of types of data consumed by the transform. In our case, we want to transform classes. Another possibility is to transform resources. The getScopes() method should return a set of scopes for the transform. In our case, we are only interested by the external libraries. It s also possible to transform our own classes. The isIncremental() method returns true because we support incremental builds. The transform() method is expected to take all the provided inputs and copy them (with or without modifications) to the location supplied by the output provider. We didn t implement this method yet. This causes the removal of all external dependencies from the application.

Noop transform To keep all external dependencies unmodified, we must copy them:
@Override
void transform(Context context,
               Collection<TransformInput> inputs,
               Collection<TransformInput> referencedInputs,
               TransformOutputProvider outputProvider,
               boolean isIncremental) throws IOException, TransformException, InterruptedException  
    inputs.each  
        it.jarInputs.each  
            def jarName = it.name
            def src = it.getFile()
            def dest = outputProvider.getContentLocation(jarName, 
                                                         it.contentTypes, it.scopes,
                                                         Format.JAR);
            def status = it.getStatus()
            if (status == Status.REMOVED)   //  
                logger.info("Remove $ src ")
                FileUtils.delete(dest)
              else if (!isIncremental   status != Status.NOTCHANGED)   //  
                logger.info("Copy $ src ")
                FileUtils.copyFile(src, dest)
             
         
     
 
We also need two additional imports:
import com.android.build.api.transform.Status
import org.apache.commons.io.FileUtils
Since we are handling external dependencies, we only have to manage JAR files. Therefore, we only iterate on jarInputs and not on directoryInputs. There are two cases when handling incremental build: either the file has been removed ( ) or it has been modified ( ). In all other cases, we can safely assume the file is already correctly copied.

JAR patching When the external dependency is the Crosswalk JAR file, we also need to modify it. Here is the first part of the code (replacing ):
if ("$ src " ==~ ".*/org.xwalk/xwalk_core.*/classes.jar")  
    def pool = new ClassPool()
    pool.insertClassPath("$ src ")
    def ctc = pool.get('org.xwalk.core.internal.SslUtil') //  
    def ctm = ctc.getDeclaredMethod('shouldDenyRequest')
    ctc.removeMethod(ctm) //  
    ctc.addMethod(CtNewMethod.make("""
public static boolean shouldDenyRequest(int error)  
    return false;
 
""", ctc)) //  
    def sslUtilBytecode = ctc.toBytecode() //  
    // Write back the JAR file
    //  
  else  
    logger.info("Copy $ src ")
    FileUtils.copyFile(src, dest)
 
We also need the following additional imports to use Javassist:
import javassist.ClassPath
import javassist.ClassPool
import javassist.CtNewMethod
Once we have located the JAR file we want to modify, we add it to our classpath and retrieve the class we are interested in ( ). We locate the appropriate method and delete it ( ). Then, we add our custom method using the same name ( ). The whole operation is done in memory. We retrieve the bytecode of the modified class in . The remaining step is to rebuild the JAR file:
def input = new JarFile(src)
def output = new JarOutputStream(new FileOutputStream(dest))
//  
input.entries().each  
    if (!it.getName().equals("org/xwalk/core/internal/SslUtil.class"))  
        def s = input.getInputStream(it)
        output.putNextEntry(new JarEntry(it.getName()))
        IOUtils.copy(s, output)
        s.close()
     
 
//  
output.putNextEntry(new JarEntry("org/xwalk/core/internal/SslUtil.class"))
output.write(sslUtilBytecode)
output.close()
We need the following additional imports:
import java.util.jar.JarEntry
import java.util.jar.JarFile
import java.util.jar.JarOutputStream
import org.apache.commons.io.IOUtils
There are two steps. In , all classes are copied to the new JAR, except the SslUtil class. In , the modified bytecode for SslUtil is added to the JAR. That s all! You can view the complete example on GitHub.

More complex method replacement In the above example, the new method doesn t use any external dependency. Let s suppose we also want to replace the sslErrorFromNetErrorCode() method from the same class with the following one:
import org.chromium.net.NetError;
import android.net.http.SslCertificate;
import android.net.http.SslError;
// In SslUtil class
public static SslError sslErrorFromNetErrorCode(int error,
                                                SslCertificate cert,
                                                String url)  
    switch(error)  
        case NetError.ERR_CERT_COMMON_NAME_INVALID:
            return new SslError(SslError.SSL_IDMISMATCH, cert, url);
        case NetError.ERR_CERT_DATE_INVALID:
            return new SslError(SslError.SSL_DATE_INVALID, cert, url);
        case NetError.ERR_CERT_AUTHORITY_INVALID:
            return new SslError(SslError.SSL_UNTRUSTED, cert, url);
        default:
            break;
     
    return new SslError(SslError.SSL_INVALID, cert, url);
 
The major difference with the previous example is that we need to import some additional classes.

Android SDK import The classes from the Android SDK are not part of the external dependencies. They need to be imported separately. The full path of the JAR file is:
androidJar = "$ android.getSdkDirectory().getAbsolutePath() /platforms/" +
             "$ android.getCompileSdkVersion() /android.jar"
We need to load it before adding the new method into SslUtil class:
def pool = new ClassPool()
pool.insertClassPath(androidJar)
pool.insertClassPath("$ src ")
def ctc = pool.get('org.xwalk.core.internal.SslUtil')
def ctm = ctc.getDeclaredMethod('sslErrorFromNetErrorCode')
ctc.removeMethod(ctm)
pool.importPackage('android.net.http.SslCertificate');
pool.importPackage('android.net.http.SslError');
//  

External dependency import We must also import org.chromium.net.NetError and therefore, we need to put the appropriate JAR in our classpath. The easiest way is to iterate through all the external dependencies and add them to the classpath.
def pool = new ClassPool()
pool.insertClassPath(androidJar)
inputs.each  
    it.jarInputs.each  
        def jarName = it.name
        def src = it.getFile()
        def status = it.getStatus()
        if (status != Status.REMOVED)  
            pool.insertClassPath("$ src ")
         
     
 
def ctc = pool.get('org.xwalk.core.internal.SslUtil')
def ctm = ctc.getDeclaredMethod('sslErrorFromNetErrorCode')
ctc.removeMethod(ctm)
pool.importPackage('android.net.http.SslCertificate');
pool.importPackage('android.net.http.SslError');
pool.importPackage('org.chromium.net.NetError');
ctc.addMethod(CtNewMethod.make(" "))
// Then, rebuild the JAR...
Happy hacking!

  1. Before Android 4.4, the webview was severely outdated. Starting from Android 5, the webview is shipped as a separate component with updates. Embedding Crosswalk is still convenient as you know exactly which version you can rely on.
  2. I hope to have this fixed in later versions.
  3. This may seem harmful and you are right. However, if you have an internal CA, it is currently not possible to provide its own trust store to a webview. Moreover, the system trust store is not used either. You also may want to use TLS for authentication only with client certificates, a feature supported by Dashkiosk.
  4. Crosswalk being an opensource project, an alternative would have been to patch Crosswalk source code and recompile it. However, Crosswalk embeds Chromium and recompiling the whole stuff consumes a lot of resources.

2 May 2016

Vincent Bernat: Pragmatic Debian packaging

While the creation of Debian packages is abundantly documented, most tutorials are targeted to packages implementing the Debian policy. Moreover, Debian packaging has a reputation of being unnecessarily difficult1 and many people prefer to use less constrained tools2 like fpm or CheckInstall. However, I would like to show how building Debian packages with the official tools can become straightforward if you bend some rules:
  1. No source package will be generated. Packages will be built directly from a checkout of a VCS repository.
  2. Additional dependencies can be downloaded during build. Packaging individually each dependency is a painstaking work, notably when you have to deal with some fast-paced ecosystems like Java, Javascript and Go.
  3. The produced packages may bundle dependencies. This is likely to raise some concerns about security and long-term maintenance, but this is a common trade-off in many ecosystems, notably Java, Javascript and Go.

Pragmatic packages 101 In the Debian archive, you have two kinds of packages: the source packages and the binary packages. Each binary package is built from a source package. You need a name for each package. As stated in the introduction, we won t generate a source package but we will work with its unpacked form which is any source tree containing a debian/ directory. In our examples, we will start with a source tree containing only a debian/ directory but you are free to include this debian/ directory into an existing project. As an example, we will package memcached, a distributed memory cache. There are four files to create:
  • debian/compat,
  • debian/changelog,
  • debian/control, and
  • debian/rules.
The first one is easy. Just put 9 in it:
echo 9 > debian/compat
The second one has the following content:
memcached (0-0) UNRELEASED; urgency=medium
  * Fake entry
 -- Happy Packager <happy@example.com>  Tue, 19 Apr 2016 22:27:05 +0200
The only important information is the name of the source package, memcached, on the first line. Everything else can be left as is as it won t influence the generated binary packages.

The control file debian/control describes the metadata of both the source package and the generated binary packages. We have to write a block for each of them.
Source: memcached
Maintainer: Vincent Bernat <bernat@debian.org>
Package: memcached
Architecture: any
Description: high-performance memory object caching system
The source package is called memcached. We have to use the same name as in debian/changelog. We generate only one binary package: memcached. In the remaining of the example, when you see memcached, this is the name of a binary package. The Architecture field should be set to either any or all. Use all exclusively if the package contains only arch-independent files. In doubt, just stick to any. The Description field contains a short description of the binary package.

The build recipe The last mandatory file is debian/rules. It s the recipe of the package. We need to retrieve memcached, build it and install its file tree in debian/memcached/. It looks like this:
#!/usr/bin/make -f
DISTRIBUTION = $(shell lsb_release -sr)
VERSION = 1.4.25
PACKAGEVERSION = $(VERSION)-0~$(DISTRIBUTION)0
TARBALL = memcached-$(VERSION).tar.gz
URL = http://www.memcached.org/files/$(TARBALL)
%:
    dh $@
override_dh_auto_clean:
override_dh_auto_test:
override_dh_auto_build:
override_dh_auto_install:
    wget -N --progress=dot:mega $(URL)
    tar --strip-components=1 -xf $(TARBALL)
    ./configure --prefix=/usr
    make
    make install DESTDIR=debian/memcached
override_dh_gencontrol:
    dh_gencontrol -- -v$(PACKAGEVERSION)
The empty targets override_dh_auto_clean, override_dh_auto_test and override_dh_auto_build keep debhelper from being too smart. The override_dh_gencontrol target sets the package version3 without updating debian/changelog. If you ignore the slight boilerplate, the recipe is quite similar to what you would have done with fpm:
DISTRIBUTION=$(lsb_release -sr)
VERSION=1.4.25
PACKAGEVERSION=$ VERSION -0~$ DISTRIBUTION 0
TARBALL=memcached-$ VERSION .tar.gz
URL=http://www.memcached.org/files/$ TARBALL 
wget -N --progress=dot:mega $ URL 
tar --strip-components=1 -xf $ TARBALL 
./configure --prefix=/usr
make
make install DESTDIR=/tmp/installdir
# Build the final package
fpm -s dir -t deb \
    -n memcached \
    -v $ PACKAGEVERSION  \
    -C /tmp/installdir \
    --description "high-performance memory object caching system"
You can review the whole package tree on GitHub and build it with dpkg-buildpackage -us -uc -b.

Pragmatic packages 102 At this point, we can iterate and add several improvements to our memcached package. None of those are mandatory but they are usually worth the additional effort.

Build dependencies Our initial build recipe only work when several packages are installed, like wget and libevent-dev. They are not present on all Debian systems. You can easily express that you need them by adding a Build-Depends section for the source package in debian/control:
Source: memcached
Build-Depends: debhelper (>= 9),
               wget, ca-certificates, lsb-release,
               libevent-dev
Always specify the debhelper (>= 9) dependency as we heavily rely on it. We don t require make or a C compiler because it is assumed that the build-essential meta-package is installed and it pulls those. dpkg-buildpackage will complain if the dependencies are not met. If you want to install those packages from your CI system, you can use the following command4:
mk-build-deps \
    -t 'apt-get -o Debug::pkgProblemResolver=yes --no-install-recommends -qqy' \
    -i -r debian/control
You may also want to investigate pbuilder or sbuild, two tools to build Debian packages in a clean isolated environment.

Runtime dependencies If the resulting package is installed on a freshly installed machine, it won t work because it will be missing libevent, a required library for memcached. You can express the dependencies needed by each binary package by adding a Depends field. Moreover, for dynamic libraries, you can automatically get the right dependencies by using some substitution variables:
Package: memcached
Depends: $ misc:Depends , $ shlibs:Depends 
The resulting package will contain the following information:
$ dpkg -I ../memcached_1.4.25-0\~unstable0_amd64.deb   grep Depends
 Depends: libc6 (>= 2.17), libevent-2.0-5 (>= 2.0.10-stable)

Integration with init system Most packaged daemons come with some integration with the init system. This integration ensures the daemon will be started on boot and restarted on upgrade. For Debian-based distributions, there are several init systems available. The most prominent ones are:
  • System-V init is the historical init system. More modern inits are able to reuse scripts written for this init, so this is a safe common denominator for packaged daemons.
  • Upstart is the less-historical init system for Ubuntu (used in Ubuntu 14.10 and previous releases).
  • systemd is the default init system for Debian since Jessie and for Ubuntu since 15.04.
Writing a correct script for the System-V init is error-prone. Therefore, I usually prefer to provide a native configuration file for the default init system of the targeted distribution (Upstart and systemd).

System-V If you want to provide a System-V init script, have a look at /etc/init.d/skeleton on the most ancient distribution you want to target and adapt it5. Put the result in debian/memcached.init. It will be installed at the right place, invoked on install, upgrade and removal. On Debian-based systems, many init scripts allow user customizations by providing a /etc/default/memcached file. You can ship one by putting its content in debian/memcached.default.

Upstart Providing an Upstart job is similar: put it in debian/memcached.upstart. For example:
description "memcached daemon"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
respawn limit 5 60
expect daemon
script
  . /etc/default/memcached
  exec memcached -d -u $USER -p $PORT -m $CACHESIZE -c $MAXCONN $OPTIONS
end script
When writing an Upstart job, the most important directive is expect. Be sure to get it right. Here, we use expect daemon and memcached is started with the -d flag.

systemd Providing a systemd unit is a bit more complex. The content of the file should go in debian/memcached.service. For example:
[Unit]
Description=memcached daemon
After=network.target
[Service]
Type=forking
EnvironmentFile=/etc/default/memcached
ExecStart=/usr/bin/memcached -d -u $USER -p $PORT -m $CACHESIZE -c $MAXCONN $OPTIONS
Restart=on-failure
[Install]
WantedBy=multi-user.target
We reuse /etc/default/memcached even if it is not considered a good practice with systemd6. Like for Upstart, the directive Type is quite important. We used forking as memcached is started with the -d flag. You also need to add a build-dependency to dh-systemd in debian/control:
Source: memcached
Build-Depends: debhelper (>= 9),
               wget, ca-certificates, lsb-release,
               libevent-dev,
               dh-systemd
And you need to modify the default rule in debian/rules:
%:
    dh $@ --with systemd
The extra complexity is a bit unfortunate but systemd integration is not part of debhelper7. Without those additional modifications, the unit will get installed but you won t get a proper integration and the service won t be enabled on install or boot.

Dedicated user Many daemons don t need to run as root and it is a good practice to ship a dedicated user. In the case of memcached, we can provide a _memcached user8. Add a debian/memcached.postinst file with the following content:
#!/bin/sh
set -e
case "$1" in
    configure)
        adduser --system --disabled-password --disabled-login --home /var/empty \
                --no-create-home --quiet --force-badname --group _memcached
        ;;
esac
#DEBHELPER#
exit 0
There is no cleanup of the user when the package is removed for two reasons:
  1. Less stuff to write.
  2. The user could still own some files.
The utility adduser will do the right thing whatever the requested user already exists or not. You need to add it as a dependency in debian/control:
Package: memcached
Depends: $ misc:Depends , $ shlibs:Depends , adduser
The #DEBHELPER# marker is important as it will be replaced by some code to handle the service configuration files (or some other stuff). You can review the whole package tree on GitHub and build it with dpkg-buildpackage -us -uc -b.

Pragmatic packages 103 It is possible to leverage debhelper to reduce the recipe size and to make it more declarative. This section is quite optional and it requires understanding a bit more how a Debian package is built. Feel free to skip it.

The big picture There are four steps to build a regular Debian package:
  1. debian/rules clean should clean the source tree to make it pristine.
  2. debian/rules build should trigger the build. For an autoconf-based software, like memcached, this step should execute something like ./configure && make.
  3. debian/rules install should install the file tree of each binary package. For an autoconf-based software, this step should execute make install DESTDIR=debian/memcached.
  4. debian/rules binary will pack the different file trees into binary packages.
You don t directly write each of those targets. Instead, you let dh, a component of debhelper, do most of the work. The following debian/rules file should do almost everything correctly with many source packages:
#!/usr/bin/make -f
%:
    dh $@
For each of the four targets described above, you can run dh with --no-act to see what it would do. For example:
$ dh build --no-act
   dh_testdir
   dh_update_autotools_config
   dh_auto_configure
   dh_auto_build
   dh_auto_test
Each of those helpers has a manual page. Helpers starting with dh_auto_ are a bit magic . For example, dh_auto_configure will try to automatically configure a package prior to building: it will detect the build system and invoke ./configure, cmake or Makefile.PL. If one of the helpers do not do the right thing, you can replace it by using an override target:
override_dh_auto_configure:
    ./configure --with-some-grog
Those helpers are also configurable, so you can just alter a bit their behaviour by invoking them with additional options:
override_dh_auto_configure:
    dh_auto_configure -- --with-some-grog
This way, ./configure will be called with your custom flag but also with a lot of default flags like --prefix=/usr for better integration. In the initial memcached example, we overrode all those magic targets. dh_auto_clean, dh_auto_configure and dh_auto_build are converted to no-ops to avoid any unexpected behaviour. dh_auto_install is hijacked to do all the build process. Additionally, we modified the behavior of the dh_gencontrol helper by forcing the version number instead of using the one from debian/changelog.

Automatic builds As memcached is an autoconf-enabled package, dh knows how to build it: ./configure && make && make install. Therefore, we can let it handle most of the work with this debian/rules file:
#!/usr/bin/make -f
DISTRIBUTION = $(shell lsb_release -sr)
VERSION = 1.4.25
PACKAGEVERSION = $(VERSION)-0~$(DISTRIBUTION)0
TARBALL = memcached-$(VERSION).tar.gz
URL = http://www.memcached.org/files/$(TARBALL)
%:
    dh $@ --with systemd
override_dh_auto_clean:
    wget -N --progress=dot:mega $(URL)
    tar --strip-components=1 -xf $(TARBALL)
override_dh_auto_test:
    # Don't run the whitespace test
    rm t/whitespace.t
    dh_auto_test
override_dh_gencontrol:
    dh_gencontrol -- -v$(PACKAGEVERSION)
The dh_auto_clean target is hijacked to download and setup the source tree9. We don t override the dh_auto_configure step, so dh will execute the ./configure script with the appropriate options. We don t override the dh_auto_build step either: dh will execute make. dh_auto_test is invoked after the build and it will run the memcached test suite. We need to override it because one of the test is complaining about odd whitespaces in the debian/ directory. We suppress this rogue test and let dh_auto_test executes the test suite. dh_auto_install is not overriden either, so dh will execute some variant of make install. To get a better sense of the difference, here is a diff:
--- memcached-intermediate/debian/rules 2016-04-30 14:02:37.425593362 +0200
+++ memcached/debian/rules  2016-05-01 14:55:15.815063835 +0200
@@ -12,10 +12,9 @@
 override_dh_auto_clean:
-override_dh_auto_test:
-override_dh_auto_build:
-override_dh_auto_install:
    wget -N --progress=dot:mega $(URL)
    tar --strip-components=1 -xf $(TARBALL)
-   ./configure --prefix=/usr
-   make
-   make install DESTDIR=debian/memcached
+
+override_dh_auto_test:
+   # Don't run the whitespace test
+   rm t/whitespace.t
+   dh_auto_test
It is up to you to decide if dh can do some work for you, but you could try to start from a minimal debian/rules and only override some targets.

Install additional files While make install installed the essential files for memcached, you may want to put additional files in the binary package. You could use cp in your build recipe, but you can also declare them:
  • files listed in debian/memcached.docs will be copied to /usr/share/doc/memcached by dh_installdocs,
  • files listed in debian/memcached.examples will be copied to /usr/share/doc/memcached/examples by dh_installexamples,
  • files listed in debian/memcached.manpages will be copied to the appropriate subdirectory of /usr/share/man by dh_installman,
Here is an example using wildcards for debian/memcached.docs:
doc/*.txt
If you need to copy some files to an arbitrary location, you can list them along with their destination directories in debian/memcached.install and dh_install will take care of the copy. Here is an example:
scripts/memcached-tool usr/bin
Using those files make the build process more declarative. It is a matter of taste and you are free to use cp in debian/rules instead. You can review the whole package tree on GitHub.

Other examples The GitHub repository contains some additional examples. They all follow the same scheme:
  • dh_auto_clean is hijacked to download and setup the source tree
  • dh_gencontrol is modified to use a computed version
Notably, you ll find daemons in Java, Go, Python and Node.js. The goal of those examples is to demonstrate that using Debian tools to build Debian packages can be straightforward. Hope this helps.

  1. People may remember the time before debhelper 7.0.50 (circa 2009) where debian/rules was a daunting beast. However, nowaday, the boilerplate is quite reduced.
  2. The complexity is not the only reason. Those alternative tools enable the creation of RPM packages, something that Debian tools obviously don t.
  3. There are many ways to version a package. Again, if you want to be pragmatic, the proposed solution should be good enough for Ubuntu. On Debian, it doesn t cover upgrade from one distribution version to another, but we assume that nowadays, systems get reinstalled instead of being upgraded.
  4. You also need to install devscripts and equivs package.
  5. It s also possible to use a script provided by upstream. However, there is no such thing as an init script that works on all distributions. Compare the proposed with the skeleton, check if it is using start-stop-daemon and if it sources /lib/lsb/init-functions before considering it. If it seems to fit, you can install it yourself in debian/memcached/etc/init.d/. debhelper will ensure its proper integration.
  6. Instead, a user wanting to customize the options is expected to edit the unit with systemctl edit.
  7. See #822670
  8. The Debian Policy doesn t provide any hint for the naming convention of those system users. A common usage is to prefix the daemon name with an underscore (like _memcached). Another common usage is to use Debian- as a prefix. The main drawback of the latest solution is that the name is likely to be replaced by the UID in ps and top because of its length.
  9. We could call dh_auto_clean at the end of the target to let it invoke make clean. However, it is assumed that a fresh checkout is used before each build.

10 April 2016

Vincent Bernat: Testing network software with pytest and Linux namespaces

Started in 2008, lldpd is an implementation of IEEE 802.1AB-2005 (aka LLDP) written in C. While it contains some unit tests, like many other network-related software at the time, the coverage of those is pretty poor: they are hard to write because the code is written in an imperative style and tighly coupled with the system. It would require extensive mocking1. While a rewrite (complete or iterative) would help to make the code more test-friendly, it would be quite an effort and it will likely introduce operational bugs along the way. To get better test coverage, the major features of lldpd are now verified through integration tests. Those tests leverage Linux network namespaces to setup a lightweight and isolated environment for each test. They run through pytest, a powerful testing tool.

pytest in a nutshell pytest is a Python testing tool whose primary use is to write tests for Python applications but is versatile enough for other creative usages. It is bundled with three killer features:
  • you can directly use the assert keyword,
  • you can inject fixtures in any test function, and
  • you can parametrize tests.

Assertions With unittest, the unit testing framework included with Python, and many similar frameworks, unit tests have to be encapsulated into a class and use the provided assertion methods. For example:
class testArithmetics(unittest.TestCase):
    def test_addition(self):
        self.assertEqual(1 + 3, 4)
The equivalent with pytest is simpler and more readable:
def test_addition():
    assert 1 + 3 == 4
pytest will analyze the AST and display useful error messages in case of failure. For further information, see Benjamin Peterson s article.

Fixtures A fixture is the set of actions performed in order to prepare the system to run some tests. With classic frameworks, you can only define one fixture for a set of tests:
class testInVM(unittest.TestCase):
    def setUp(self):
        self.vm = VM('Test-VM')
        self.vm.start()
        self.ssh = SSHClient()
        self.ssh.connect(self.vm.public_ip)
    def tearDown(self):
        self.ssh.close()
        self.vm.destroy()
    def test_hello(self):
        stdin, stdout, stderr = self.ssh.exec_command("echo hello")
        stdin.close()
        self.assertEqual(stderr.read(), b"")
        self.assertEqual(stdout.read(), b"hello\n")
In the example above, we want to test various commands on a remote VM. The fixture launches a new VM and configure an SSH connection. However, if the SSH connection cannot be established, the fixture will fail and the tearDown() method won t be invoked. The VM will be left running. Instead, with pytest, we could do this:
@pytest.yield_fixture
def vm():
    r = VM('Test-VM')
    r.start()
    yield r
    r.destroy()
@pytest.yield_fixture
def ssh(vm):
    ssh = SSHClient()
    ssh.connect(vm.public_ip)
    yield ssh
    ssh.close()
def test_hello(ssh):
    stdin, stdout, stderr = ssh.exec_command("echo hello")
    stdin.close()
    stderr.read() == b""
    stdout.read() == b"hello\n"
The first fixture will provide a freshly booted VM. The second one will setup an SSH connection to the VM provided as an argument. Fixtures are used through dependency injection: just give their names in the signature of the test functions and fixtures that need them. Each fixture only handle the lifetime of one entity. Whatever a dependent test function or fixture succeeds or fails, the VM will always be finally destroyed.

Parameters If you want to run the same test several times with a varying parameter, you can dynamically create test functions or use one test function with a loop. With pytest, you can parametrize test functions and fixtures:
@pytest.mark.parametrize("n1, n2, expected", [
    (1, 3, 4),
    (8, 20, 28),
    (-4, 0, -4)])
def test_addition(n1, n2, expected):
    assert n1 + n2 == expected

Testing lldpd The general plan for to test a feature in lldpd is the following:
  1. Setup two namespaces.
  2. Create a virtual link between them.
  3. Spawn a lldpd process in each namespace.
  4. Test the feature in one namespace.
  5. Check with lldpcli we get the expected result in the other.
Here is a typical test using the most interesting features of pytest:
@pytest.mark.skipif('LLDP-MED' not in pytest.config.lldpd.features,
                    reason="LLDP-MED not supported")
@pytest.mark.parametrize("classe, expected", [
    (1, "Generic Endpoint (Class I)"),
    (2, "Media Endpoint (Class II)"),
    (3, "Communication Device Endpoint (Class III)"),
    (4, "Network Connectivity Device")])
def test_med_devicetype(lldpd, lldpcli, namespaces, links,
                        classe, expected):
    links(namespaces(1), namespaces(2))
    with namespaces(1):
        lldpd("-r")
    with namespaces(2):
        lldpd("-M", str(classe))
    with namespaces(1):
        out = lldpcli("-f", "keyvalue", "show", "neighbors", "details")
        assert out['lldp.eth0.lldp-med.device-type'] == expected
First, the test will be executed only if lldpd was compiled with LLDP-MED support. Second, the test is parametrized. We will execute four distinct tests, one for each role that lldpd should be able to take as an LLDP-MED-enabled endpoint. The signature of the test has four parameters that are not covered by the parametrize() decorator: lldpd, lldpcli, namespaces and links. They are fixtures. A lot of magic happen in those to keep the actual tests short:
  • lldpd is a factory to spawn an instance of lldpd. When called, it will setup the current namespace (setting up the chroot, creating the user and group for privilege separation, replacing some files to be distribution-agnostic, ), then call lldpd with the additional parameters provided. The output is recorded and added to the test report in case of failure. The module also contains the creation of the pytest.config.lldpd object that is used to record the features supported by lldpd and skip non-matching tests. You can read fixtures/programs.py for more details.
  • lldpcli is also a factory, but it spawns instances of lldpcli, the client to query lldpd. Moreover, it will parse the output in a dictionary to reduce boilerplate.
  • namespaces is one of the most interesting pieces. It is a factory for Linux namespaces. It will spawn a new namespace or refer to an existing one. It is possible to switch from one namespace to another (with with) as they are contexts. Behind the scene, the factory maintains the appropriate file descriptors for each namespace and switch to them with setns(). Once the test is done, everything is wipped out as the file descriptors are garbage collected. You can read fixtures/namespaces.py for more details. It is quite reusable in other projects2.
  • links contains helpers to handle network interfaces: creation of virtual ethernet link between namespaces, creation of bridges, bonds and VLAN, etc. It relies on the pyroute2 module. You can read fixtures/network.py for more details.
You can see an example of a test run on the Travis build for 0.9.2. Since each test is correctly isolated, it s possible to run parallel tests with pytest -n 10 --boxed. To catch even more bugs, both the address sanitizer (ASAN) and the undefined behavior sanitizer (UBSAN) are enabled. In case of a problem, notably a memory leak, the faulty program will exit with a non-zero exit code and the associated test will fail.

  1. A project like cwrap would definitely help. However, it lacks support for Netlink and raw sockets that are essential in lldpd operations.
  2. There are three main limitations in the use of namespaces with this fixture. First, when creating a user namespace, only root is mapped to the current user. With lldpd, we have two users (root and _lldpd). Therefore, the tests have to run as root. The second limitation is with the PID namespace. It s not possible for a process to switch from one PID namespace to another. When you call setns() on a PID namespace, only children of the current process will be in the new PID namespace. The PID namespace is convenient to ensure everyone gets killed once the tests are terminated but you must keep in mind that /proc must be mounted in children only. The third limitation is that, for some namespaces (PID and user), all threads of a process must be part of the same namespace. Therefore, don t use threads in tests. Use multiprocessing module instead.

3 January 2016

Lunar: Reproducible builds: week 35 in Stretch cycle

What happened in the reproducible builds effort between December 20th to December 26th: Toolchain fixes Mattia Rizzolo rebased our experimental versions of debhelper (twice!) and dpkg on top of the latest releases. Reiner Herrmann submited a patch for mozilla-devscripts to sort the file list in generated preferences.js files. To be able to lift the restriction that packages must be built in the same path, translation support for the __FILE__ C pre-processor macro would also be required. Joerg Sonnenberger submitted a patch back in 2010 that would still be useful today. Chris Lamb started work on providing a deterministic mode for debootstrap. Packages fixed The following packages have become reproducible due to changes in their build dependencies: bouncycastle, cairo-dock-plug-ins, darktable, gshare, libgpod, pafy, ruby-redis-namespace, ruby-rouge, sparkleshare. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues, but not all of them: Patches submitted which have not made their way to the archive yet: reproducible.debian.net Statistics for package sets are now visible for the armhf architecture. (h01ger) The second build now has a longer timeout (18 hours) than the first build (12 hours). This should prevent wasting resources when a machine is loaded. (h01ger) Builds of Arch Linux packages are now done using a tmpfs. (h01ger) 200 GiB have been added to jenkins.debian.net (thanks to ProfitBricks!) to make room for new jobs. The current count is at 962 and growing! diffoscope development Aside from some minor bugs that have been fixed, a one-line change made huge memory (and time) savings as the output of transformation tool is now streamed line by line instead of loaded entirely in memory at once. disorderfs development Andrew Ayer released disorderfs version 0.4.2-1 on December 22th. It fixes a memory corruption error when processing command line arguments that could cause command line options to be ignored. Documentation update Many small improvements for the documentation on reproducible-builds.org sent by Georg Koppen were merged. Package reviews 666 (!) reviews have been removed, 189 added and 162 updated in the previous week. 151 new fail to build from source reports have been made by Chris West, Chris Lamb, Mattia Rizzolo, and Niko Tyni. New issues identified: unsorted_filelist_in_xul_ext_preferences, nondeterminstic_output_generated_by_moarvm. Misc. Steven Chamberlain drew our attention to one analysis of the Juniper ScreenOS Authentication Backdoor: Whilst this may have been added in source code, it was well-disguised in the disassembly and just 7 instructions long. I thought this was a good example of the current state-of-the-art, and why we'd like our binaries and eventually, installer and VM images reproducible IMHO. Joanna Rutkowska has mentioned possible ways for Qubes to become reproducible on their development mailing-list.

1 September 2015

Lunar: Reproducible builds: week 18 in Stretch cycle

What happened in the reproducible builds effort this week: Toolchain fixes Aur lien Jarno uploaded glibc/2.21-0experimental1 which will fix the issue were locales-all did not behave exactly like locales despite having it in the Provides field. Lunar rebased the pu/reproducible_builds branch for dpkg on top of the released 1.18.2. This made visible an issue with udebs and automatically generated debug packages. The summary from the meeting at DebConf15 between ftpmasters, dpkg mainatainers and reproducible builds folks has been posted to the revelant mailing lists. Packages fixed The following 70 packages became reproducible due to changes in their build dependencies: activemq-activeio, async-http-client, classworlds, clirr, compress-lzf, dbus-c++, felix-bundlerepository, felix-framework, felix-gogo-command, felix-gogo-runtime, felix-gogo-shell, felix-main, felix-shell-tui, felix-shell, findbugs-bcel, gco, gdebi, gecode, geronimo-ejb-3.2-spec, git-repair, gmetric4j, gs-collections, hawtbuf, hawtdispatch, jack-tools, jackson-dataformat-cbor, jackson-dataformat-yaml, jackson-module-jaxb-annotations, jmxetric, json-simple, kryo-serializers, lhapdf, libccrtp, libclaw, libcommoncpp2, libftdi1, libjboss-marshalling-java, libmimic, libphysfs, libxstream-java, limereg, maven-debian-helper, maven-filtering, maven-invoker, mochiweb, mongo-java-driver, mqtt-client, netty-3.9, openhft-chronicle-queue, openhft-compiler, openhft-lang, pavucontrol, plexus-ant-factory, plexus-archiver, plexus-bsh-factory, plexus-cdc, plexus-classworlds2, plexus-component-metadata, plexus-container-default, plexus-io, pytone, scolasync, sisu-ioc, snappy-java, spatial4j-0.4, tika, treeline, wss4j, xtalk, zshdb. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which have not made their way to the archive yet: Chris Lamb also noticed that binaries shipped with libsilo-bin did not work. Documentation update Chris Lamb and Ximin Luo assembled a proper specification for SOURCE_DATE_EPOCH in the hope to convince more upstreams to adopt it. Thanks to Holger it is published under a non-Debian domain name. Lunar documented easiest way to solve issues with file ordering and timestamps in tarballs that came with tar/1.28-1. Some examples on how to use SOURCE_DATE_EPOCH have been improved to support systems without GNU date. reproducible.debian.net armhf is finally being tested, which also means the remote building of Debian packages finally works! This paves the way to perform the tests on even more architectures and doing variations on CPU and date. Some packages even produce the same binary Arch:all packages on different architectures (1, 2). (h01ger) Tests for FreeBSD are finally running. (h01ger) As it seems the gcc5 transition has cooled off, we schedule sid more often than testing again on amd64. (h01ger) disorderfs has been built and installed on all build nodes (amd64 and armhf). One issue related to permissions for root and unpriviliged users needs to be solved before disorderfs can be used on reproducible.debian.net. (h01ger) strip-nondeterminism Version 0.011-1 has been released on August 29th. The new version updates dh_strip_nondeterminism to match recent changes in debhelper. (Andrew Ayer) disorderfs disorderfs, the new FUSE filesystem to ease testing of filesystem-related variations, is now almost ready to be used. Version 0.2.0 adds support for extended attributes. Since then Andrew Ayer also added support to reverse directory entries instead of shuffling them, and arbitrary padding to the number of blocks used by files. Package reviews 142 reviews have been removed, 48 added and 259 updated this week. Santiago Vila renamed the not_using_dh_builddeb issue into varying_mtimes_in_data_tar_gz_or_control_tar_gz to align better with other tag names. New issue identified this week: random_order_in_python_doit_completion. 37 FTBFS issues have been reported by Chris West (Faux) and Chris Lamb. Misc. h01ger gave a talk at FrOSCon on August 23rd. Recordings are already online. These reports are being reviewed and enhanced every week by many people hanging out on #debian-reproducible. Huge thanks!

25 August 2015

Lunar: Reproducible builds: week 17 in Stretch cycle

A good amount of the Debian reproducible builds team had the chance to enjoy face-to-face interactions during DebConf15.
Names in red and blue were all present at DebConf15
Picture of the  reproducible builds  talk during DebConf15
Hugging people with whom one has been working tirelessly for months gives a lot of warm-fuzzy feelings. Several recorded and hallway discussions paved the way to solve the remaining issues to get reproducible builds part of Debian proper. Both talks from the Debian Project Leader and the release team mentioned the effort as important for the future of Debian. A forty-five minutes talk presented the state of the reproducible builds effort. It was then followed by an hour long roundtable to discuss current blockers regarding dpkg, .buildinfo and their integration in the archive. Picture of the  reproducible builds  roundtable during DebConf15 Toolchain fixes Reiner Herrmann submitted a patch to make rdfind sort the processed files before doing any operation. Chris Lamb proposed a new patch for wheel implementing support for SOURCE_DATE_EPOCH instead of the custom WHEEL_FORCE_TIMESTAMP. akira sent one making man2html SOURCE_DATE_EPOCH aware. St phane Glondu reported that dpkg-source would not respect tarball permissions when unpacking under a umask of 002. After hours of iterative testing during the DebConf workshop, Sandro Knau created a test case showing how pdflatex output can be non-deterministic with some PNG files. Packages fixed The following 65 packages became reproducible due to changes in their build dependencies: alacarte, arbtt, bullet, ccfits, commons-daemon, crack-attack, d-conf, ejabberd-contrib, erlang-bear, erlang-cherly, erlang-cowlib, erlang-folsom, erlang-goldrush, erlang-ibrowse, erlang-jiffy, erlang-lager, erlang-lhttpc, erlang-meck, erlang-p1-cache-tab, erlang-p1-iconv, erlang-p1-logger, erlang-p1-mysql, erlang-p1-pam, erlang-p1-pgsql, erlang-p1-sip, erlang-p1-stringprep, erlang-p1-stun, erlang-p1-tls, erlang-p1-utils, erlang-p1-xml, erlang-p1-yaml, erlang-p1-zlib, erlang-ranch, erlang-redis-client, erlang-uuid, freecontact, givaro, glade, gnome-shell, gupnp, gvfs, htseq, jags, jana, knot, libconfig, libkolab, libmatio, libvsqlitepp, mpmath, octave-zenity, openigtlink, paman, pisa, pynifti, qof, ruby-blankslate, ruby-xml-simple, timingframework, trace-cmd, tsung, wings3d, xdg-user-dirs, xz-utils, zpspell. The following packages became reproducible after getting fixed: Uploads that might have fixed reproducibility issues: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which have not made their way to the archive yet: St phane Glondu reported two issues regarding embedded build date in omake and cduce. Aur lien Jarno submitted a fix for the breakage of make-dfsg test suite. As binutils now creates deterministic libraries by default, Aur lien's patch makes use of a wrapper to give the U flag to ar. Reiner Herrmann reported an issue with pound which embeds random dhparams in its code during the build. Better solutions are yet to be found. reproducible.debian.net Package pages on reproducible.debian.net now have a new layout improving readability designed by Mattia Rizzolo, h01ger, and Ulrike. The navigation is now on the left as vertical space is more valuable nowadays. armhf is now enabled on all pages except the dashboard. Actual tests on armhf are expected to start shortly. (Mattia Rizzolo, h01ger) The limit on how many packages people can schedule using the reschedule script on Alioth has been bumped to 200. (h01ger) mod_rewrite is now used instead of JavaScript for the form in the dashboard. (h01ger) Following the rename of the software, debbindiff has mostly been replaced by either diffoscope or differences in generated HTML and IRC notification output. Connections to UDD have been made more robust. (Mattia Rizzolo) diffoscope development diffoscope version 31 was released on August 21st. This version improves fuzzy-matching by using the tlsh algorithm instead of ssdeep. New command line options are available: --max-diff-input-lines and --max-diff-block-lines to override limits on diff input and output (Reiner Herrmann), --debugger to dump the user into pdb in case of crashes (Mattia Rizzolo). jar archives should now be detected properly (Reiner Herrman). Several general code cleanups were also done by Chris Lamb. strip-nondeterminism development Andrew Ayer released strip-nondeterminism version 0.010-1. Java properties file in jar should now be detected more accurately. A missing dependency spotted by St phane Glondu has been added. Testing directory ordering issues: disorderfs During the reproducible builds workshop at DebConf, participants identified that we were still short of a good way to test variations on filesystem behaviors (e.g. file ordering or disk usage). Andrew Ayer took a couple of hours to create disorderfs. Based on FUSE, disorderfs in an overlay filesystem that will mount the content of a directory at another location. For this first version, it will make the order in which files appear in a directory random. Documentation update Dhole documented how to implement support for SOURCE_DATE_EPOCH in Python, bash, Makefiles, CMake, and C. Chris Lamb started to convert the wiki page describing SOURCE_DATE_EPOCH into a Freedesktop-like specification in the hope that it will convince more upstream to adopt it. Package reviews 44 reviews have been removed, 192 added and 77 updated this week. New issues identified this week: locale_dependent_order_in_devlibs_depends, randomness_in_ocaml_startup_files, randomness_in_ocaml_packed_libraries, randomness_in_ocaml_custom_executables, undeterministic_symlinking_by_rdfind, random_build_path_by_golang_compiler, and images_in_pdf_generated_by_latex. 117 new FTBFS bugs have been reported by Chris Lamb, Chris West (Faux), and Niko Tyni. Misc. Some reproducibility issues might face us very late. Chris Lamb noticed that the test suite for python-pykmip was now failing because its test certificates have expired. Let's hope no packages are hiding a certificate valid for 10 years somewhere in their source! Pictures courtesy and copyright of Debian's own paparazzi: Aigars Mahinovs.

20 June 2015

Lunar: Reproducible builds: week 5 in Stretch cycle

What happened about the reproducible builds effort for this week: Toolchain fixes Uploads that should help other packages: Patch submitted for toolchain issues: Some discussions have been started in Debian and with upstream: Packages fixed The following 8 packages became reproducible due to changes in their build dependencies: access-modifier-checker, apache-log4j2, jenkins-xstream, libsdl-perl, maven-shared-incremental, ruby-pygments.rb, ruby-wikicloth, uimaj. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which did not make their way to the archive yet: Discussions that have been started: reproducible.debian.net Holger Levsen added two new package sets: pkg-javascript-devel and pkg-php-pear. The list of packages with and without notes are now sorted by age of the latest build. Mattia Rizzolo added support for email notifications so that maintainers can be warned when a package becomes unreproducible. Please ask Mattia or Holger or in the #debian-reproducible IRC channel if you want to be notified for your packages! strip-nondeterminism development Andrew Ayer fixed the gzip handler so that it skip adding a predetermined timestamp when there was none. Documentation update Lunar added documentation about mtimes of file extracted using unzip being timezone dependent. He also wrote a short example on how to test reproducibility. Stephen Kitt updated the documentation about timestamps in PE binaries. Documentation and scripts to perform weekly reports were published by Lunar. Package reviews 50 obsolete reviews have been removed, 51 added and 29 updated this week. Thanks Chris West and Mathieu Bridon amongst others. New identified issues: Misc. Lunar will be talking (in French) about reproducible builds at Pas Sage en Seine on June 19th, at 15:00 in Paris. Meeting will happen this Wednesday, 19:00 UTC.

27 May 2015

Vincent Bernat: Live patching QEMU for VENOM mitigation

CVE-2015-3456, also known as VENOM, is a security vulnerability in QEMU virtual floppy controller:
The Floppy Disk Controller (FDC) in QEMU, as used in Xen [ ] and KVM, allows local guest users to cause a denial of service (out-of-bounds write and guest crash) or possibly execute arbitrary code via the FD_CMD_READ_ID, FD_CMD_DRIVE_SPECIFICATION_COMMAND, or other unspecified commands.
Even when QEMU has been configured with no floppy drive, the floppy controller code is still active. The vulnerability is easy to test1:
#define FDC_IOPORT 0x3f5
#define FD_CMD_READ_ID 0x0a
int main()  
    ioperm(FDC_IOPORT, 1, 1);
    outb(FD_CMD_READ_ID, FDC_IOPORT);
    for (size_t i = 0;; i++)
        outb(0x42, FDC_IOPORT);
    return 0;
 
Once the fix installed, all processes still have to be restarted for the upgrade to be effective. It is possible to minimize the downtime by leveraging virsh save. Another possibility would be to patch the running processes. The Linux kernel attracted a lot of interest in this area, with solutions like Ksplice (mostly killed by Oracle), kGraft (by Red Hat) and kpatch (by Suse) and the inclusion of a common framework in the kernel. The userspace has far less out-of-the-box solutions2. I present here a simple and self-contained way to patch a running QEMU to remove the vulnerability without requiring any sensible downtime. Here is a short demonstration:

Proof of concept First, let s find a workaround that would be simple to implement through live patching: while modifying running code text is possible, it is easier to modify a single variable.

Concept Looking at the code of the floppy controller and the patch, we can avoid the vulnerability by not accepting any command on the FIFO port. Each request would be answered by Invalid command (0x80) and a user won t be able to push more bytes to the FIFO until the answer is read and the FIFO queue reset. Of course, the floppy controller would be rendered useless in this state. But who cares? The list of commands accepted by the controller on the FIFO port is contained in the handlers[] array:
static const struct  
    uint8_t value;
    uint8_t mask;
    const char* name;
    int parameters;
    void (*handler)(FDCtrl *fdctrl, int direction);
    int direction;
  handlers[] =  
      FD_CMD_READ, 0x1f, "READ", 8, fdctrl_start_transfer, FD_DIR_READ  ,
      FD_CMD_WRITE, 0x3f, "WRITE", 8, fdctrl_start_transfer, FD_DIR_WRITE  ,
    /* [...] */
      0, 0, "unknown", 0, fdctrl_unimplemented  , /* default handler */
 ;
To avoid browsing the array each time a command is received, another array is used to map each command to the appropriate handler:
/* Associate command to an index in the 'handlers' array */
static uint8_t command_to_handler[256];
static void fdctrl_realize_common(FDCtrl *fdctrl, Error **errp)
 
    int i, j;
    static int command_tables_inited = 0;
    /* Fill 'command_to_handler' lookup table */
    if (!command_tables_inited)  
        command_tables_inited = 1;
        for (i = ARRAY_SIZE(handlers) - 1; i >= 0; i--)  
            for (j = 0; j < sizeof(command_to_handler); j++)  
                if ((j & handlers[i].mask) == handlers[i].value)  
                    command_to_handler[j] = i;
                 
             
         
     
    /* [...] */
 
Our workaround is to modify the command_to_handler[] array to map all commands to the fdctrl_unimplemented() handler (the last one in the handlers[] array).

Testing with gdb To check if the workaround works as expected, we test it with gdb. Unless you have compiled QEMU yourself, you need to install a package with debug symbols. Unfortunately, on Debian, they are not available, yet3. On Ubuntu, you can install the qemu-system-x86-dbgsym package after enabling the appropriate repositories. The following function for gdb maps every command to the unimplemented handler:
define patch
  set $handler = sizeof(handlers)/sizeof(*handlers)-1
  set $i = 0
  while ($i < 256)
   set variable command_to_handler[$i++] = $handler
  end
  printf "Done!\n"
end
Attach to the vulnerable process (with attach), call the function (with patch) and detach of the process (with detach). You can check that the exploit is not working anymore. This could be easily automated.

Limitations Using gdb has two main limitations:
  1. It needs to be installed on each host to be patched.
  2. The debug packages need to be installed as well. Moreover, it can be difficult to fetch previous versions of those packages.

Writing a custom patcher To overcome those limitations, we can write a customer patcher using the ptrace() system call without relying on debug symbols being present.

Finding the right memory spot Before being able to modify the command_to_handler[] array, we need to know its location. The first clue is given by the symbol table. To query it, use readelf -s:
$ readelf -s /usr/lib/debug/.build-id/09/95121eb46e2a4c13747ac2bad982829365c694.debug   \
>   sed -n -e 1,3p -e /command_to_handler/p
Symbol table '.symtab' contains 27066 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
  8485: 00000000009f9d00   256 OBJECT  LOCAL  DEFAULT   26 command_to_handler
This table is usually stripped out of the executable to save space, like shown below:
$ file -b /usr/bin/qemu-system-x86_64   tr , \\n
ELF 64-bit LSB shared object
 x86-64
 version 1 (SYSV)
 dynamically linked
 interpreter /lib64/ld-linux-x86-64.so.2
 for GNU/Linux 2.6.32
 BuildID[sha1]=0995121eb46e2a4c13747ac2bad982829365c694
 stripped
If your distribution provides a debug package, the debug symbols are installed in /usr/lib/debug. Most modern distributions are now relying on the build ID4 to map an executable to its debugging symbols, like the example above. Without a debug package, you need to recompile the existing package without stripping debug symbols in a clean environment5. On Debian, this can be done by setting the DEB_BUILD_OPTIONS environment variable to nostrip. We have now two possible cases:
  • the easy one, and
  • the hard one.

The easy case On x86, here is the standard layout of a regular Linux process in memory6: Memory layout of a regular process on x86 The random gaps (ASLR) are here to prevent an attacker from reliably jumping to a particular exploited function in memory. On x86-64, the layout is quite similar. The important point is that the base address of the executable is fixed. The memory mapping of a process is also available through /proc/PID/maps. Here is a shortened and annotated example on x86-64:
$ cat /proc/3609/maps
00400000-00401000         r-xp 00000000 fd:04 483  not-qemu [text segment]
00601000-00602000         r--p 00001000 fd:04 483  not-qemu [data segment]
00602000-00603000         rw-p 00002000 fd:04 483  not-qemu [BSS segment]
[random gap]
02419000-0293d000         rw-p 00000000 00:00 0    [heap]
[random gap]
7f0835543000-7f08356e2000 r-xp 00000000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08356e2000-7f08358e2000 ---p 0019f000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08358e2000-7f08358e6000 r--p 0019f000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08358e6000-7f08358e8000 rw-p 001a3000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08358e8000-7f08358ec000 rw-p 00000000 00:00 0
7f08358ec000-7f083590c000 r-xp 00000000 fd:01 5138 /lib/x86_64-linux-gnu/ld-2.19.so
7f0835aca000-7f0835acd000 rw-p 00000000 00:00 0
7f0835b08000-7f0835b0c000 rw-p 00000000 00:00 0
7f0835b0c000-7f0835b0d000 r--p 00020000 fd:01 5138 /lib/x86_64-linux-gnu/ld-2.19.so
7f0835b0d000-7f0835b0e000 rw-p 00021000 fd:01 5138 /lib/x86_64-linux-gnu/ld-2.19.so
7f0835b0e000-7f0835b0f000 rw-p 00000000 00:00 0
[random gap]
7ffdb0f85000-7ffdb0fa6000 rw-p 00000000 00:00 0    [stack]
With a regular executable, the value given in the symbol table is an absolute memory address:
$ readelf -s not-qemu   \
>   sed -n -e 1,3p -e /command_to_handler/p
Symbol table '.dynsym' contains 9 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    47: 0000000000602080   256 OBJECT  LOCAL  DEFAULT   25 command_to_handler
So, the address of command_to_handler[], in the above example, is just 0x602080.

The hard case To enhance security, it is possible to load some executables at a random base address, just like a library. Such an executable is called a Position Independent Executable (PIE). An attacker won t be able to rely on a fixed address to find some helpful function. Here is the new memory layout: Memory layout of a PIE process on x86 With a PIE process, the value in the symbol table is now an offset from the base address.
$ readelf -s not-qemu-pie   sed -n -e 1,3p -e /command_to_handler/p
Symbol table '.dynsym' contains 17 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    47: 0000000000202080   256 OBJECT  LOCAL  DEFAULT   25 command_to_handler
If we look at /proc/PID/maps, we can figure out where the array is located in memory:
$ cat /proc/12593/maps
7f6c13565000-7f6c13704000 r-xp 00000000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c13704000-7f6c13904000 ---p 0019f000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c13904000-7f6c13908000 r--p 0019f000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c13908000-7f6c1390a000 rw-p 001a3000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c1390a000-7f6c1390e000 rw-p 00000000 00:00 0
7f6c1390e000-7f6c1392e000 r-xp 00000000 fd:01 5138  /lib/x86_64-linux-gnu/ld-2.19.so
7f6c13b2e000-7f6c13b2f000 r--p 00020000 fd:01 5138  /lib/x86_64-linux-gnu/ld-2.19.so
7f6c13b2f000-7f6c13b30000 rw-p 00021000 fd:01 5138  /lib/x86_64-linux-gnu/ld-2.19.so
7f6c13b30000-7f6c13b31000 rw-p 00000000 00:00 0
7f6c13b31000-7f6c13b33000 r-xp 00000000 fd:04 4594  not-qemu-pie [text segment]
7f6c13cf0000-7f6c13cf3000 rw-p 00000000 00:00 0
7f6c13d2e000-7f6c13d32000 rw-p 00000000 00:00 0
7f6c13d32000-7f6c13d33000 r--p 00001000 fd:04 4594  not-qemu-pie [data segment]
7f6c13d33000-7f6c13d34000 rw-p 00002000 fd:04 4594  not-qemu-pie [BSS segment]
[random gap]
7f6c15c46000-7f6c15c67000 rw-p 00000000 00:00 0     [heap]
[random gap]
7ffe823b0000-7ffe823d1000 rw-p 00000000 00:00 0     [stack]
The base address is 0x7f6c13b31000, the offset is 0x202080 and therefore, the location of the array is 0x7f6c13d33080. We can check with gdb:
$ print &command_to_handler
$1 = (uint8_t (*)[256]) 0x7f6c13d33080 <command_to_handler>

Patching a memory spot Once we know the location of the command_to_handler[] array in memory, patching it is quite straightforward. First, we start tracing the target process:
/* Attach to the running process */
static int
patch_attach(pid_t pid)
 
    int status;
    printf("[.] Attaching to PID %d...\n", pid);
    if (ptrace(PTRACE_ATTACH, pid, NULL, NULL) == -1)  
        fprintf(stderr, "[!] Unable to attach to PID %d: %m\n", pid);
        return -1;
     
    if (waitpid(pid, &status, 0) == -1)  
        fprintf(stderr, "[!] Error while attaching to PID %d: %m\n", pid);
        return -1;
     
    assert(WIFSTOPPED(status)); /* Tracee may have died */
    if (ptrace(PTRACE_GETSIGINFO, pid, NULL, &si) == -1)  
        fprintf(stderr, "[!] Unable to read siginfo for PID %d: %m\n", pid);
        return -1;
     
    assert(si.si_signo == SIGSTOP); /* Other signals may have been received */
    printf("[*] Successfully attached to PID %d\n", pid);
    return 0;
 
Then, we retrieve the command_to_handler[] array, modify it and put it back in memory7.
static int
patch_doit(pid_t pid, unsigned char *target)
 
    int ret = -1;
    unsigned char *command_to_handler = NULL;
    size_t i;
    /* Get the table */
    printf("[.] Retrieving command_to_handler table...\n");
    command_to_handler = ptrace_read(pid,
                                     target,
                                     QEMU_COMMAND_TO_HANDLER_SIZE);
    if (command_to_handler == NULL)  
        fprintf(stderr, "[!] Unable to read command_to_handler table: %m\n");
        goto out;
     
    /* Check if the table has already been patched. */
    /* [...] */
    /* Patch it */
    printf("[.] Patching QEMU...\n");
    for (i = 0; i < QEMU_COMMAND_TO_HANDLER_SIZE; i++)  
        command_to_handler[i] = QEMU_NOT_IMPLEMENTED_HANDLER;
     
    if (ptrace_write(pid, target, command_to_handler,
           QEMU_COMMAND_TO_HANDLER_SIZE) == -1)  
        fprintf(stderr, "[!] Unable to patch command_to_handler table: %m\n");
        goto out;
     
    printf("[*] QEMU successfully patched!\n");
    ret = 0;
out:
    free(command_to_handler);
    return ret;
 
Since ptrace() only allows to read or write a word at a time, ptrace_read() and ptrace_write() are wrappers to read or write arbitrary large chunks of memory8. Here is the code for ptrace_read():
/* Read memory of the given process */
static void *
ptrace_read(pid_t pid, void *address, size_t size)
 
    /* Allocate the buffer */
    uword_t *buffer = malloc((size/sizeof(uword_t) + 1)*sizeof(uword_t));
    if (!buffer) return NULL;
    /* Read word by word */
    size_t readsz = 0;
    do  
        errno = 0;
        if ((buffer[readsz/sizeof(uword_t)] =
                ptrace(PTRACE_PEEKTEXT, pid,
                       (unsigned char*)address + readsz,
                       0)) && errno)  
            fprintf(stderr, "[!] Unable to peek one word at address %p: %m\n",
                    (unsigned char *)address + readsz);
            free(buffer);
            return NULL;
         
        readsz += sizeof(uword_t);
      while (readsz < size);
    return (unsigned char *)buffer;
 

Putting the pieces together The patcher is provided with the following information:
  • the PID of the process to be patched,
  • the command_to_handler[] offset from the symbol table, and
  • the build ID of the executable file used to get this offset (as a safety measure).
The main steps are:
  1. Attach to the process with ptrace().
  2. Get the executable name from /proc/PID/exe.
  3. Parse /proc/PID/maps to find the address of the text segment (it s the first one).
  4. Do some sanity checks:
    • check there is a ELF header at this location (4-byte magic number),
    • check the executable type (ET_EXEC for regular executables, ET_DYN for PIE), and
    • get the build ID and compare with the expected one.
  5. From the base address and the provided offset, compute the location of the command_to_handler[] array.
  6. Patch it.
You can find the complete patcher on GitHub.
$ ./patch --build-id 0995121eb46e2a4c13747ac2bad982829365c694 \
>         --offset 9f9d00 \
>         --pid 16833
[.] Attaching to PID 16833...
[*] Successfully attached to PID 16833
[*] Executable name is /usr/bin/qemu-system-x86_64
[*] Base address is 0x7f7eea912000
[*] Both build IDs match
[.] Retrieving command_to_handler table...
[.] Patching QEMU...
[*] QEMU successfully patched!

  1. The complete code for this test is on GitHub.
  2. An interesting project seems to be Katana. But there are also some insightful hacking papers on the subject.
  3. Some packages come with a -dbg package with debug symbols, some others don t. Fortunately, a proposal to automatically produce debugging symbols for everything is near completion.
  4. The Fedora Wiki contains the rationale behind the build ID.
  5. If the build is incorrectly reproduced, the build ID won t match. The information provided by the debug symbols may or may not be correct. Debian currently has a reproducible builds effort to ensure that each package can be reproduced.
  6. Anatomy of a program in memory is a great blog post explaining in more details how a program lives in memory.
  7. Being an uninitialized static variable, the variable is in the BSS section. This section is mapped to a writable memory segment. If it wasn t the case, with Linux, the ptrace() system call is still allowed to write. Linux will copy the page and mark it as private.
  8. With Linux 3.2 or later, process_vm_readv() and process_vm_writev() can be used to transfer data from/to a remote process without using ptrace() at all. However, ptrace() would still be needed to reliably stop the main thread.

4 February 2015

Vincent Bernat: Directory bookmarks with Zsh

There are numerous projects to implement directory bookmarks in your favorite shell. An inherent limitation of those implementations is they being only an enhanced cd command: you cannot use a bookmark in an arbitrary command. UPDATED: My initial implementation with Zsh was using dynamic named directories. I have been pointed on Twitter that there is a simpler way to implement bookmarks. The article has been updated to reflect that. As a side note, it is also possible to just use shell variables1. Zsh comes with a not well-known feature called static named directories. They are declared with the hash builtin and can be refered by prepending ~ to them:
$ hash -d -- -lldpd=/home/bernat/code/deezer/lldpd
$ echo ~-lldpd/README.md
/home/bernat/code/deezer/lldpd/README.md
$ head -n1 ~-lldpd/README.md
lldpd: implementation of IEEE 802.1ab (LLDP)
Because ~-lldpd is substituted during file name expansion, it is possible to use it in any command like a regular directory, like shown above. The - prefix is only here to avoid collision with home directories. Bookmarks are kept into a dedicated directory, $MARKPATH. Each bookmark is a symbolic link to the target directory: for example, ~-lldpd should be expanded to $MARKPATH/lldpd which points to the appropriate directory. Assuming that you have populated $MARKPATH with some links, here is how bookmarks are registered:
for link ($MARKPATH/*(N@))  
    hash -d -- -$ link:t =$ link:A 
 
You also automatically get completion and prompt expansion:
$ pwd
/home/bernat/code/deezer/lldpd/src/lib
$ echo $ (%):-%~ 
~-lldpd/src/lib
The last step is to manage bookmarks without adding or removing symbolic links manually. The following bookmark() function will display the existing bookmarks when called without arguments, will remove a bookmark when called with -d or add the current directory as a bookmark otherwise.
bookmark()  
    if (( $# == 0 )); then
        # When no arguments are provided, just display existing
        # bookmarks
        for link in $MARKPATH/*(N@); do
            local markname="$fg[green]$ link:t $reset_color"
            local markpath="$fg[blue]$ link:A $reset_color"
            printf "%-30s -> %s\n" $markname $markpath
        done
    else
        # Otherwise, we may want to add a bookmark or delete an
        # existing one.
        local -a delete
        zparseopts -D d=delete
        if (( $+delete[1] )); then
            # With  -d , we delete an existing bookmark
            command rm $MARKPATH/$1
        else
            # Otherwise, add a bookmark to the current
            # directory. The first argument is the bookmark
            # name.  .  is special and means the bookmark should
            # be named after the current directory.
            local name=$1
            [ $name == "." ] && name=$ PWD:t 
            ln -s $PWD $MARKPATH/$name
        fi
    fi
 
Find the complete version on GitHub.

Dynamic named directories Another (more complex) way to achieve the same thing is using dynamic named directories. I was initially using this solution but it is far more complex. This section is only here for historical reason. You can find the complete implementation in my GitHub repository. During file name expansion, a ~ followed by a string in square brackets is provided to the zsh_directory_name() function which will eventually reply with a directory name. This feature can be used to implement directory bookmarks:
$ cd ~[@lldpd]
$ pwd
/home/bernat/code/deezer/lldpd
$ echo ~[@lldpd]/README.md
/home/bernat/code/deezer/lldpd/README.md
$ head -n1 ~[@lldpd]/README.md
lldpd: implementation of IEEE 802.1ab (LLDP)
Like for static named directories, because ~[@lldpd] is substituted during file name expansion, it is possible to use it in any command like a regular directory.

Basic implementation Bookmarks are still kept into a dedicated directory, $MARKPATH and are still symbolic links. Here is how the core feature is implemented:
_bookmark_directory_name()  
    emulate -L zsh #  
    setopt extendedglob
    case $1 in
        n)
            [ [ $2 != ( #b)"@"(?*) ]] && return 1 #  
            typeset -ga reply
            reply=($ $ :-$MARKPATH/$match[1] :A ) #  
            return 0
            ;;
        *)
            return 1
            ;;
    esac
    return 0
 
add-zsh-hook zsh_directory_name _bookmark_directory_name
zsh_directory_name() is a function accepting hooks2: instead of defining it directly, we define another function and register it as a hook with add-zsh-hook. The hook is expected to handle different situations. The first one is to be able to transform a dynamic name into a regular directory name. In this case, the first parameter of the function is n and the second one is the dynamic name. In , the call to emulate will restore the pristine behaviour of Zsh and also ensure that any option set in the scope of the function will not have an impact outside. The function can then be reused safely in another environment. In , we check that the dynamic name starts with @ followed by at least one character. Otherwise, we declare we don t know how to handle it. Another hook will get the chance to do something. (#b) is a globbing flag. It activates backreferences for parenthesised groups. When a match is found, it is stored as an array, $match. In , we build the reply. We could have just returned $MARKPATH/$match[1] but to hide the symbolic link mechanism, we use the A modifier to ask Zsh to resolve symbolic links if possible. Zsh allows nested substitutions. It is therefore possible to use modifiers and flags on anything. $ :-$MARKPATH/$match[1] is a common trick to turn $MARKPATH/$match[1] into a parameter substitution and be able to apply the A modifier on it.

Completion Zsh is also able to ask for completion of a dynamic directory name. In this case, the completion system calls the hook function with c as the first argument.
_bookmark_directory_name()  
    # [...]
    case $1 in
        c)
            # Completion
            local expl
            local -a dirs
            dirs=($MARKPATH/*(N@:t)) #  
            dirs=("@"$ ^dirs ) #  
            _wanted dynamic-dirs expl 'bookmarked directory' compadd -S\] -a dirs
            return
            ;;
        # [...]
    esac
    # [...]
 
First, in , we create a list of possible bookmarks. In *(N@:t), N@ is a glob qualifier. N allows us to not return nothing if there is no match (otherwise, we would get an error) while @ only returns symbolic links. t is a modifier which will remove all leading pathname components. This is equivalent to use basename or $ something##*/ in POSIX shells but it plays nice with glob expressions. In , we just add @ before each bookmark name. If we have b1, b2 and b3 as bookmarks, $ ^dirs expands to b1,b2,b3 and therefore "@"$ ^dirs expands to the (@b1 @b2 @b3) array. The result is then feeded into the completion system.

Prompt expansion Many people put the name of the current directory in their prompt. It would be nice to have the bookmark name instead of the full name when we are below a bookmarked directory. That s also possible!
$ pwd
/home/bernat/code/deezer/lldpd/src/lib
$ echo $ (%):-%~ 
~[@lldpd]/src/lib
The prompt expansion system calls the hook function with d as first argument and the file name to transform.
_bookmark_directory_name()  
    # [...]
    case $1 in
        d)
            local link slink
            local -A links
            for link ($MARKPATH/*(N@))  
                links[$  #link:A $'\0'$ link:A ]=$ link:t  #  
             
            for slink ($ (@On)$ (k)links )  
                link=$ slink#*$'\0'  #  
                if [ [ $2 = ( #b)($ link )( /*) ]]; then
                    typeset -ga reply
                    reply=("@"$ links[$slink]  $(( $  #match[1]  )) )
                    return 0
                fi
             
            return 1
            ;;
        # [...]
    esac
    # [...]
 
OK. This is some black Zsh wizardry. Feel free to skip the explanation. This is a bit complex because we want to substitute the most specific bookmark, hence the need to sort bookmarks by their target lengths. In , the associative array $links is created by iterating on each symbolic link ($link) in the $MARKPATH directory. The goal is to map a target directory with the matching bookmark name. However, we need to iterate on this map from the longest to the shortest key. To achieve that, we prepend each key with its length. Remember, $ link:A is the absolute path with symbolic links resolved. So, $ #link:A is the length of this path. We concatenate the length of the target directory with the target directory name and use $'\0' as a separator because this is the only safe character for this purpose. The result is mapped to the bookmark name. The second loop is an iteration on the keys of the associative array $links (thanks to the use of the k parameter flag in $ (k)links ). Those keys are turned into an array (@ parameter flag) and sorted numerically in descending order (On parameter flag). Since the keys are directory names prefixed by their lengths, the first match will be the longest one. In , we extract the directory name from the key by removing the length and the null character at the beginning. Then, we check if the extracted directory name matches the file name we have been provided. Again, (#b) just activates backreferences. With extended globbing, we can use the or operator, . So, when either the file name matches exactly the directory name or is somewhere deeper, we create the reply which is an array whose first member is the bookmark name and the second member is the untranslated part of the file name.

Easy typing Typing ~[@ is cumbersome. Hopefully, Zsh line editor can be extended with additional bindings. The following snippet will substitute @@ (if typed without a pause) by ~[@:
vbe-insert-bookmark()  
    emulate -L zsh
    LBUFFER=$ LBUFFER "~[@"
 
zle -N vbe-insert-bookmark
bindkey '@@' vbe-insert-bookmark
In combination with the autocd option and completion, it is quite easy to jump to a bookmarked directory.

  1. For example:
    $ lldpd=/home/bernat/code/deezer/lldpd
    $ echo $lldpd/README.md
    /home/bernat/code/deezer/lldpd/README.md
    $ head -n1 $lldpd/README.md
    lldpd: implementation of IEEE 802.1ab (LLDP)
    
    The drawback is that you don t have a separate namespace for your bookmarks. You can still use a special prefix for that. Also, no prompt expansion.
  2. Other functions accepting hooks are chpwd() or precmd().

23 December 2014

Vincent Bernat: Eudyptula Challenge: superfast Linux kernel booting

The Eudyptula Challenge is a series of programming exercises for the Linux kernel, that start from a very basic Hello world kernel module, moving on up in complexity to getting patches accepted into the main Linux kernel source tree.
One of the first tasks of this quite interesting challenge is to compile and boot your own kernel. eudyptula-boot is a self-contained shell script to boot any kernel image to a shell. It is packed with the following features: In the following video, eudyptula-boot is used to boot the host kernel and execute a few commands:
In the next one, we use it to boot a custom kernel with an additional system call. This is the fifteenth task of the Eudyptula Challenge. A test program is used to check that the system call is working as expected. Additionaly, we demonstrate how to attach a debugger to the running kernel.
While this hack could be used to run containers3 with an increased isolation, the performance of the 9p filesystem is unfortunately quite poor.

  1. The only requirement is to have 9p virtio support enabled. This can easily be enabled with make kvmconfig.
  2. Only udev is started.
  3. A good way to start a container is to combine --root, --force and --exec parameters. Add --readwrite to the mix if you want to keep the modifications.

18 November 2014

Erich Schubert: Generate iptables rules via pyroman

Vincent Bernat blogged on using Netfilter rulesets, pointing out that inserting the rules one-by-one using iptables calls may leave your firewall temporarily incomplete, eventually half-working, and that this approach can be slow.
He's right with that, but there are tools that do this properly. ;-)
Some years ago, for a multi-homed firewall, I wrote a tool called Pyroman. Using rules specified either in Python or XML syntax, it generates a firewall ruleset for you.
But it also adresses the points Vincent raised:
  • It uses iptables-restore to load the firewall more efficiently than by calling iptables a hundred times
  • It will backup the previous firewall, and roll-back on errors (or lack of confirmation, if you are remote and use --safe)
It also has a nice feature for the use in staging: it can generate firewall rule sets offline, to allow you reviewing them before use, or transfer them to a different host. Not all functionality is supported though (e.g. the Firewall.hostname constant usable in python conditionals will still be the name of the host you generate the rules on - you may want to add a --hostname parameter to pyroman)
pyroman --print-verbose will generate a script readable by iptables-restore except for one problem: it contains both the rules for IPv4 and for IPv6, separated by #### IPv6 rules. It will also annotate the origin of the rule, for example:
# /etc/pyroman/02_icmpv6.py:82
-A rfc4890f -p icmpv6 --icmpv6-type 255 -j DROP
indicates that this particular line was produced due to line 82 in file /etc/pyroman/02_icmpv6.py. This makes debugging easier. In particular it allows pyroman to produce a meaningful error message if the rules are rejected by the kernel: it will tell you which line caused the rule that was rejected.
For the next version, I will probably add --output-ipv4 and --output-ipv6 options to make this more convenient to use. So far, pyroman is meant to be used on the firewall itself.
Note: if you have configured a firewall that you are happy with, you can always use iptables-save to dump the current firewall. But it will not preserve comments, obviously.

Next.

Previous.